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Abstract 


String searching consists in locating a substring in a longer text, and two strings can be 
approximately equal (various similarity measures such as the Hamming distance exist). 
Strings can be dehned very broadly, and they usually contain natural language and 
biological data (DNA, proteins), but they can also represent other kinds of data such as 
music or images. 

One solution to string searching is to use online algorithms which do not preprocess 
the input text, however, this is often infeasible due to the massive sizes of modern data 
sets. Alternatively, one can build an index, i.e. a data structure which aims to speed up 
string matching queries. The indexes are divided into full-text ones which operate on 
the whole input text and can answer arbitrary queries and keyword indexes which store 
a dictionary of individual words. In this work, we present a literature review for both 
index categories as well as our contributions (which are mostly practice-oriented). 

The hrst contribution is the FM-bloated index, which is a modihcation of the well-known 
FM-index (a compressed, full-text index) that trades space for speed. In our approach, 
the count table and the occurrence lists store information about selected g-grams in 
addition to the individual characters. Two variants are described, namely one using 
O(nlog^n) bits of space with 0{m + log m log log n) average query time, and one with 
linear space and 0(m log log n) average query time, where n is the input text length 
and m is the pattern length. We experimentally show that a signihcant speedup can be 
achieved by operating on g-grams (albeit at the cost of very high space requirements, 
hence the name “bloated”). 

In the category of keyword indexes we present the so-called split index, which can effi¬ 
ciently solve the ^-mismatches problem, especially for 1 error. Our implementation in the 
C++ language is focused mostly on data compaction, which is benehcial for the search 
speed (by being cache friendly). We compare our solution with other algorithms and 
we show that it is faster when the Hamming distance is used. Query times in the order 
of 1 microsecond were reported for one mismatch for a few-megabyte natural language 
dictionary on a medium-end PC. 

A minor contribution includes string sketches which aim to speed up approximate string 
comparison at the cost of additional space (0(1) per string). They can be used in 
the context of keyword indexes in order to deduce that two strings differ by at least k 
mismatches with the use of fast bitwise operations rather than an explicit verihcation. 
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Chapter 1 


Introduction 


The Bible, which consists of the Old and the New Testament, is composed of roughly 800 
thousand words (in the English language version) m- Literary works of such stature 
were often regarded as good candidates for creating concordances — listings of words 
that originated from the specific work. Such collections usually included positions of 
the words, which allowed the reader to learn about their frequency and context. Their 
assembly was a non-trivial task that required a lot of effort. Under a rather favorable 
assumption that a friar (today also referred to as a research assistant) would be able 
to achieve a throughput of one word per minute, compilation (do not confuse with code 
generation) for the Bible would require over thirteen thousand man-hours, or roughly one 
and a half years of constant work. This naturally ignores additional efforts, for instance 
printing and dissemination. 

Such a listing is one of the earliest examples of a text-based data structure constructed 
with the purpose of faster searches at the cost of space and preprocessing. Luckily, today 
we are capable of building and using various structures in a much shorter time. With the 
aid of silicone, electrons, and capable human minds, we have managed to decrease the 
times from years to seconds (indexing) and from seconds to microseconds (searching). 


1.1 Applications 

String searching has always been ubiquitous in everyday life, most probably since the 
very creation of the written word. In the modern world, we encounter text on a regular 
basis — on paper, glass, rubber, human skin, metal, cement, and since the 20th century 
also on electronic displays. We perform various text-based operations almost all the time, 
often subconsciously. This happens in trivial situations such as looking for interesting 
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news on a website on a slow Sunday afternoon, or trying to locate information in the bus 
timetable on a cold Monday morning. Many familiar text-related tasks can be finished 
faster thanks to computers, and powerful machines are also crucial to scientific research. 
Specific areas are discussed in the following subsections. 

1.1.1 Natural language 

For years, the main application of computers to textual data was natural language dSLl) 
processing, which goes back to the work of Alan Turing in the 1950s |Tur5n) . The goal was 
to understand the meaning as well as the context in which the language was used. One of 
the first programs that could actually comprehend and act upon English sentences was 
Bobrow’s STUDENT (1967), which solved simple mathematical problems |RN10[ p. 19]. 
The first application to text processing where string searching algorithms could really 
shine was spell checking, i.e. determining whether a word is written in a correct form. It 
consists in testing whether a word is present in a INLI dictionary (a set of words). Such 
a functionality is required since spelling errors appear relatively often due to a variety 
of reasons, ranging from writer ignorance to typing and transmission errors. Research 
in this area started around 1957, and the first spell checker available as an application 
is believed to have appeared in 1971 |Pet8n) . Today, spell checking is universal, and it 
is performed by most programs which accept user input. This includes dedicated text 
editors, programming tools, email clients, command-line interfaces, and web browsers. 
More sophisticated approaches which try to take the context into account were also 
described |CV81[ IMit87] , due to the fact that checking for dictionary membership is prone 
to errors (e.g., mistyping “were” for “where”; Peterson |Pet86) reported that up to 16% of 
errors might be undetected). Another familiar scenario is searching for words in a textual 
document such as a book or an article, which allows for locating relevant fragments in a 
much shorter time than by skimming through the text. Determining positions of certain 
keywords in order to learn their context (neighboring words) may be also useful for 
plagiarism detection (including the plagiarism of computer programs |PH89) ). 

With the use of approximate methods, similar words can be obtained from the INLI dic- 
tionary and correct spelling can be suggested (spelling correction is usually coupled with 
spell checking). This may also include proper nouns, for example in the case of shop¬ 
ping catalogs (relevant products) or geographic information systems (specific locations, 
e.g., cities). Such techniques are also useful for optical character recognition flOCRD 
where they serve as a verification mechanism |EL90] . Other applications are in security, 
where it is desirable to check whether a password is not too close to a word from a dictio¬ 
nary |MW94a] and in data cleaning, which consists in detecting errors and duplication in 
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data that is stored in the database |CGGM03] . String matching is also employed for pre¬ 
venting the registration of frandnlent websites having similar addresses, the phenomenon 
known as “typosqnatting” |ME10) . 

It may happen that the pattern that is searched for is not explicitly specified, as is the 
case when we nse a web search engine (i.e. we wonld like to find the entire website, bnt we 
specify only a few keywords), which is an example of information retrieval. For instance, 
index-based methods form an important component of the architectnre of the Google 
engine |BP98j . 

1.1.2 Bioinformatics 

The biological data is commonly represented in a textnal form, and for this reason it can 
be searched jnst like any other text. Most popnlar representations inclnde: 

• DNA — the alphabet of fonr symbols corresponding to nncleobases: A, C, G, and 
T. It can be extended with an additional character N, indicating that there might 
be any nncleobase at a specified position. This is nsed, for instance, when the 
seqnencing method conld not determine the nncleobase with a desired certainty. 
Sometimes, additional information snch as the qnality of the read, i.e. the proba¬ 
bility that a specific base was determined correctly, is also stored. 

• RNA — fonr nncleobases: A, C, G, and U. Similarly to the DNA, additional infor¬ 
mation may be present. 

• Proteins — 20 symbols corresponding to different amino acids (nppercase letters 
from the English alphabet), with 2 additional symbols for amino acids occnrring 
only in some species (0 and U), and 4 placeholders (B, J, X, Z) for sitnations where 
the amino acid is ambignons. All 26 letters from the English alphabet are nsed. 

Gompntational information was an integral part of the field of bioinformatics from the 
very beginning, and at the end of the 1970s there was a snbstantial activity in the 
development of string (seqnence) alignment algorithms (e.g., for RNA strnctnre predic¬ 
tion) |OV03) . Alignment methods allow for finding evolntionary relationships between 
genes and proteins and thns constrnct phylogenetic trees. Seqnence similarity in proteins 
is important becanse it may imply strnctnral as well as fnnctional similarity. Researchers 
nse tools snch as BLAST |AGM^90) . which try to match the string in qnestion with sim¬ 
ilar ones from the database (of proteins or genomes). Approximate methods play an 













Introduction 


4 


important role here, because related sequences often differ from one another due to mu¬ 
tations in the genetic material. These include point mutations, that is changes at a single 
position, as well as insertions and deletions (usually called indels). 


Another research area that would not thrive without computers is genome sequencing. 
This is caused by the fact that sequencing methods cannot read the whole genome, but 
rather they produce hundreds of gigabytes of strings (DNA reads, whose typical length 
is from tens to a thousand base pairs |LLL^12) 1 whose exact positions in the genome 
are not known. Moreover, the reads often contain mistakes due to the imperfection 
of the sequencing itself. The goal of the computers is to calculate the correct order 
using complicated statistical and string-based tools, with or without a reference genome 
(the latter being called de novo sequencing). This process is well illustrated by its 
name — shotgun sequencing, and it can be likened to shredding a piece of paper and 
reconstructing the pieces. String searching is crucial here because it allows for finding 
repeated occurrences of certain patterns |LTP'*~09 ISDlOj . 


1.1.3 Other 

Other data can be also represented and manipulated in a textual form. This includes mu¬ 
sic, where we would like to locate a specific melody, especially using approximate methods 
which account for slight variations or imperfections (e.g., singing out of pitch) |Gralll 
p. 77]. Another field where approximate methods play a crucial role is signal processing, 
especially in the case of audio signals, which can be processed by speech recognition 
algorithms. Such a functionality is becoming more and more popular nowadays, due 
to the evolution of multimedia databases containine audiovisual data |Nav m- String 
algorithms can be also used in intrusion detection systems, where their goal is to identify 
malicious activities by matching data such as system state graphs, instruction sequences, 
or packets with those from the database |KS94l ITSCV04] . String searching can be also 
applied for the detection of arbitrary two-dimensional shapes in images |BB93) . and yet 
another application is in compression algorithms, where it is desirable to find repetitive 
patterns in a similar way to sequence searching in biological data. Due to the fact that 
almost any data can be represented in a textual form many other application areas exist, 
see, e.g., Navarro |Nav01) for more information. 

This diversity of data causes the string algorithms to be used in very different scenarios. 
The pattern size can vary from a few letters (|NLI words! to a few hundred (DNA reads), 
and the input text can be of almost arbitrary size. For instance, Google reported in 2015 
that their web search index has reached over 100 thousand terabytes (10^^ bytes) |gooj . 
Massive data is also present in bioinformatics, where the size of the genome of a single 
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organism is often measured in gigabytes; one of the largest animal genomes belong to the 
lungfish and the salamander, each occupying approximately 120 Gbp |Gren2) (i.e. roughly 
28GB, assuming that each base is coded with 2 bits). As regards proteins, the UniProt 
protein database stores approximately 50 million sequences (each composed of roughly 
a few hundred symbols) in 2015 and continues to grow exponentially |uni) . It was re¬ 
marked recently that biological textual databases grow more quickly than the ability to 
understand them |BG15| . 

When it comes to data of such magnitude, it is feasible only to perform an index-based 
search (meaning that the data is preprocessed), which is the main focus of this thesis. It 
seems most likely that the data sizes will continue to grow, and for this reason there is 
a clear need for the development of algorithms which are efficient in practice. 


1.2 Preliminaries 

This section presents an overview of data structures and algorithms which act as building 
blocks for the ones presented later, and it introduces the necessary terminology. String 
searching, which is the main topic of this thesis, is described in the following chapter. 

Throughout this thesis, data structures are usually approached from two angles: theoret¬ 
ical, which concentrates on the worst-case space and query time, and a practical one. The 
latter focuses on performance in real-world scenarios, and it is often heuristically oriented 
and focused on cache utilization and reducing slow RAM access. It is worth noting that 
state-of-the-art theoretical algorithms sometimes perform very poor in practice because 
of certain constant factors which are ignored in the analysis. Moreover, they might not 
be even tested or implemented at all. On the other hand, a practical evaluation depends 
heavily on the hardware (peculiarities of the GPU cache, instruction prefetching, etc), 
properties of the data sets used as input, and most importantly on the implementation. 
Moffat and Gog |MG14) provided an extensive analysis of experimentation in the field 
of string searching, and they pointed out various caveats. These include for instance a 
bias towards certain repetitive patterns when the patterns are randomly sampled from 
the input text, or the advantage of smaller data sets which increase the probability that 
(at least some of) the data would fit into the cache. 

The theoretical analysis of the algorithms is based on the big O family of asymptotic no¬ 
tations, including O, Q, 0, and the relevant lower case counterparts (we assume that the 
reader is familiar with these tools and with complexity classes). Unless stated otherwise, 
the complexity analysis refers to the worst-case scenario, and all logarithms are assumed 
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to be base 2 (this might be also stated explicitly as log 2 ). When we state that the com¬ 
plexity or the average or the worst case is equal to some value, we mean the running 
time of the algorithm. On the other hand, if the time or space is explicitly mentioned, 
the word “complexity” might be omitted. Array, string, and vector indexes are always 0- 
based and they are assumed to be contiguous, and collection indexes are 1-based (e.g., a 
collection of strings si,..., Sn)- We consider a standard hierarchical memory model with 
RAM and a faster CPU cache, and we take for granted that the data always fits into the 


main memory, i.e. disk input/output (I/O) is ignored. Moreover, we assume that the size 
of the data does not exceed 2^^ bytes, which means that it is sufficient for each pointer 
or counter to occupy 32 bits (4 bytes). Sizes that are given in kilobytes and megabytes 
are indicated with abbreviations KB and MB, which refer to standard computer science 
quantities, i.e. 2^*^ (rather than 1,000) and 2“^^. 


1.2.1 Sorting 

Sorting consists in ordering n elements from a given set S in such a way that the following 
holds: Vi G [0,n — 1) : S[i] ^ ^[i -|- 1], that is the smallest element is always in front. 
In reverse sorting, the highest element is in front, and the inequality sign is reversed. 
Popular sorting methods include the heapsort and the mergesort with 0(n log n) worst- 
case time guarantees. Another well-known algorithm is the quicksort with average time 
0(n logn) (although the worst case is equal to 0(n^)), which is known to be 2-3 times 
faster in practice than both heapsort and mergesort |Ski98) . There also exist algorithms 
linear in n which can be used in certain scenarios, for instance the radix sort for integers 
with time complexity 0(wn) (or 0(n(w/ logn}) for a byte-wide radix), where w is the 
machine word size. 

When it comes to sorting n strings of average length m, a comparison sorting method 
would take 0(nlognm) time (assuming that comparing two strings is linear in time). 
Alternatively, we could obtain an 0{nm) time bound by sorting each letter column with 
a sorting method which is linear for a fixed alphabet (essentially performing a radix 
sort), e.g., using a counting sort. Moreover, we can even achieve an 0{n) complexity by 
building a trie with lexicographically ordered children at each level and performing a pre¬ 
order, depth-first search (jPFSI) (see the following subsections for details). When it comes 
to suffix sorting (i.e. sorting all suffixes of the input text), dedicated methods which do 
not have a linear time guarantee are often used due to reduced space requirements or 
good practical performance |MF04l IPST07] . Recently, linear methods which are efficient 
in oractice have also been described |Non I3]. 
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1.2.2 Trees 

A tree contains multiple nodes that are connected with each other, with one (top-most) 
node designated as the root. Every node contains zero or more children, and a tree is an 
undirected graph where any two vertexes are connected by exactly one path (there are no 
cycles). Further terminology which is relevant to trees is as follows |CLRS09l Sec. B.5]: 

• A parent is a neighbor of a child and it is located closer to the root (and vice 
versa). 

• A sibling is a node which shares the same parent. 

• Leaves are the nodes without children and in the graphical representation they 
are always shown at the bottom of the diagram. The leaves are also called external 
nodes, and internal nodes are all nodes other than leaves. 

• Descendants are the nodes located anywhere in the subtree rooted by the current 
node, and ancestors are the nodes anywhere on the path from the root (inclusive) 
to the current node. Proper descendants and ancestors exclude the current node. 
If ni is an ancestor of n 2 , then n 2 is a descendant of ni, and vice versa. 

• The depth of a node is the length of the path from this node to the root. 

• The height of a tree is the longest path from the root to any leaf (i.e. the depth 
of the deepest node). 

The maximum number of children can be limited for each node. Many structures are 
binary trees, which means that every node has at most two children, and a generic term 
is a fe-ary tree (or multiary for k > 2). A full (complete) k-aiy tree is a structure where 
every node has exactly 0 (in the case of leaves) or k (in the case of internal nodes) 
children. A perfect tree is a tree where all leaves have the same depth. A historical note: 
apparently, a binary tree used to be called a bifurcating arborescence in the early years 
of computer science |Knu97l p. 363]. A balanced (height-balanced, self-balancing) tree 
is a tree whose height is maintained with respect to its total size irrespective of possible 
updates and deletions. The height of a balanced binary tree is logarithmic, i.e. O(logn) 
(log;, n for a k-aiy tree). It is often desirable to maintain such a balance because otherwise 
a tree may lose its properties (e.g., worst-case search complexity). This is caused by the 
fact that the time complexity of various algorithms is proportional to the height of the 
tree. 

There exist many kinds of trees, and they are characterized by some additional properties 
which make them useful for a certain purpose. 
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1.2.2.1 Binary search tree 

The binary search tree (IBSTp is used for determining a membership in the set or for 
storing key-value pairs. Every node stores one value V] the value of its right child is 
always bigger than V, and the value of its left child is always smaller than V. The lookup 
operation consists in traversing the tree towards the leaves until either the value is found 
or there are no more nodes to process, which indicates that the value is not present. The 
IBSTI is often used to maintain a collection of numbers, however, the values can also be 
strings (they are ordered alphabetically), see Figure [Hi It is crucial that the IBKTI is 
balanced. Otherwise, in the scenario where every node had exactly one child, its height 
would be linear (basically forming a linked list) and thus the complexity for the traversal 
would degrade from O(logn) to 0{n). The occupied space is clearly linear (there is one 
node per value) and the preprocessing takes 0(nlogn) time because each insertion costs 
O(logn). 



Figure 1.1; A binary search tree (IBSTI) storing strings from the English alphabet. 
The value of the right child is always bigger than the value of the parent, and the value 
of the left child is always smaller than the value of the parent. 


1.2.2.2 Trie 

The trie (digital tree) |Fre60) is a tree in which the position of a node (more specihcally, a 
path from the root to the node) describes the associated value, see Figure [L^ The nodes 
often store IDs or flags which indicate whether a given node has a word, which is required 
because some nodes may be only intermediary and not associated with any value. The 
values are often strings, and the paths may correspond to the prefixes of the input text. A 
trie supports basic operations such as searching, insertion, and deletion. For the lookup, 
we check whether each consecutive character from the query is present in the trie while 
moving towards the leaves, hence the search complexity is directly proportional to the 
length of the pattern. In order to build a trie, we have to perform a full lookup for each 
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word, thus the preprocessing complexity is equal to 0{n) for words of total length n. 
The space is linear because there is at most one node per input character. 



Figure 1.2; A trie, which is one of the basic structures used in string searching, con¬ 
structed for strings from the set {A, in, inn, tea, ted, ten, to}. Each edge corresponds 
to one character, and the strings are stored implicitly (here shown for clarity). Addi¬ 
tional information such as IDs (here shown inside the parentheses) is sometimes kept 

in the nodes. 

Various modifications of the regular trie exist — an example could be the Patricia 
trie |Mor68) . whose aim is to reduce the occupied space. The idea is to merge every 
node which has no siblings with its parent, thus reducing the total number of nodes, 
and resulting edge labels include the characters from all edges that were merged (the 
complexities are unchanged). 

1.2.3 Hashing 

A hash function H transforms data of arbitrary size into the data of fixed size. Typical 
output sizes include 64, 128, 256, and 512 bits. The input can be in principle of any 
type, although hash functions are usually designed so that they work well for a particular 
kind of data, e.g., for strings or for integers. Hash functions often have certain desirable 
properties such as a limited number of collisions, where for two chunks of data di and d 2 , 
the probability that H{di) = H{d 2 ) should be relatively low (e.g., H is called universal 
if Pr{H{di) = H{d 2 )) ^ 1/n for an n-element hash table |CW77| ). There exists a group 
of cryptographic hash functions, which offer certain guarantees regarding the number 
of collisions. They also provide non-reversibility, which means that it is hard (in the 
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mathematical sense, for example the problem may be NP-hard) to deduce the value of 
the input string from the hash value. Such properties are provided at the price of reduced 
speed, and for this reason cryptographic hash functions are usually not used for string 
matching. 


A perfect hash function guarantees no collisions, e.g., the two-level FKS scheme with 0{n) 
space |FKS84| . All keys have to be usually known beforehand, although dynamic perfect 
hashing was also considered DKMZm]. A minimal perfect hash function fIMPHFp 
uses every bucket in the hash table, i.e. there is one value per bucket (the lower space 
bound for describing an IMPHFl is equal to roughly 1.44n bits for n elements |BBD09j i. 
The complexity of a hash function is usually linear in the input length, although it is 
sometimes assumed that it takes constant time. 


A hash function is an integral part of a hash table Ht, which is a data structure that 
associates the values with buekets based on the key, i.e. the hash of the value. This can 
be represented with the following relation: Ht[H{v)] = v for any value v. Hash tables 
are often used in string searching because they allow for quick membership queries, see 


Figure 1.3 The size of the hash table is usually much smaller than the number of 
all possible hash values, and it is often the case that a collision occurs (the same key 
is produced for two different values). There exist various methods of resolving such 
collisions, and the most popular ones are as follows: 


• Chaining — each bucket holds a list of all values which hashed to this bucket. 

• Probing — if a collision occurs, the value is inserted into the next unoccupied 
bucket. This may be linear probing, where the consecutive buckets are scanned 
linearly until an empty bucket is found, or quadratic probing, where the gaps 
between consecutive buckets are formed by the results of a quadratic polynomial. 

• Double hashing — gaps between consecutive buckets are determined by another 
hash function. A simple approach could be for instance to locate the next bucket 
index i using the formula i = Hi{v)+iH 2 {v) mod \Ht\ for any two hash functions 
Hi and H 2 . 


In order to resolve the collisions, the keys have to be usually stored as well. The tech¬ 
niques which try to locate an empty bucket (as opposed to chaining) are referred to as 
open addressing. A key characteristic of the hash table is its load factor dLEl), which 
is dehned as the number of entries divided by the number of buckets. Let us note that 
0 ^ Lp ^ 1.0 for open addressing (the performance degrades rapidly as Lp —)• 1.0), 
however, in the case of chaining it holds that 0 ^ Lp ^ n for n entries. 
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hash 

keys function buckets 


John Smith 


Lisa Smith 


Sandra Dee 



Figure 1.3: A hash table for strings; reproduced from Wikimedia Common^ 

1.2.4 Data structure comparison 


In the previous subsections, we introduced data structures which are used by more so¬ 
phisticated algorithms (described in the following chapters). Still, they can be also used 
on their own for exact string searching. In Figure o we present a comparison of their 
complexities together with a linear, direct access array. It is to be noted that even though 
the worst case of a hash table lookup is linear (iterating over one bucket which stores all 
the elements), it is extremely unlikely, and any popular hash function offers reasonable 
guarantees against building such a degenerate hash table. 


Data structure 

Lookup 

Preprocessing 

Space 

Array 

0(n) 

0(1) 

0{n) 

Balanced BST 

O(logn) 

0(n log n) 

0(n) 

Hash table 

0{m) avg, 0{n) worst-case 

0(n) 

0{n) 

Trie 

0{m) 

0(n) 

0(n) 


Table 1.1: A comparison of the complexities of basic data structures which can be 
used for exact string searching. Here, we assume that string comparison takes constant 

time. 


1.2.5 Compression 

Compression consists in representing the data in an alternative (encoded) form with 
the purpose of reducing the size. After compression, the data can be decompressed 
(decoded) in order to obtain the original representation. Typical applications include 
reducing storage sizes and saving bandwidth during transmission. Compression can be 

^Jorge Stolfi, available at http://en.wikipedia.Org/wiki/File:Hash_table_3_l_l_0_l_ 
0_0_SP.svg, CC A-SA 3.0. 
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either lossless or lossy, depending on whether the result of decompression matches the 
original data. The former is useful especially when it comes to multimedia (frequently 
used by domain-specific methods such as those based on human perception of images), 
where the lower quality may be acceptable or even indiscernible, and storing the data in 
an uncompressed form is often infeasible. For instance, the original size of a 2-hour Full 
HD movie with 32 bits per pixel and 24 frames per second would amount to more than 
one terabyte. Data that can be compressed is sometimes called redundant. 

One of the most popular compression methods is character substitution, where the se¬ 
lected symbols (bit g-grams) are replaced with ones that take less space. A classic al¬ 
gorithm is called Huffman coding |Huf52| . and it offers an optimal substitution method. 
Based on g-gram frequencies, it produces a codebook which maps more frequent charac¬ 
ters to shorter codes, in such a way that every code is uniquely decodable (e.g., 00 and 
1 are uniquely decodable, but 00 and 0 are not). For real-world data, Huffman coding 
offers compression rates that are close to the entropy (see the following subsection), and 
it is often used as a component of more complex algorithms. We refer the reader to 
Salomon’s monograph |Sal04) for more information on data compression. 


1.2.5.1 Entropy 

We can easily determine the compression ratio by taking the size (a number of occupied 
bits) of the original data and dividing it by the size of the compressed data, r = n/uc- 
It may seem that the following should hold: r ^ 1, however, certain algorithms might 
actually increase the data size after compressing it when operating on an inconvenient 
data set (which is of course highly undesirable). A related problem is how to determine 
the “compressibility” of the data, i.e. the optimal compression ratio (the highest r). This 
brings us to the notion of entropy, sometimes also called Shannon’s entropy after the 
name of the author |Sha48) . It describes the amount of information which is contained 
in a message, and in the case of strings it determines the average number of bits which 
is required in order to encode an input symbol under a specified alphabet and frequency 
distribution. This means that the entropy describes a theoretical bound on data com¬ 
pression (one that cannot be exceeded by any algorithm). Higher entropy means that 
it is more difficult to compress the data (e.g., when multiple symbols appear with equal 
frequency). The formula is presented in Figure 

n 

E = -K'^pilogpi. 

i=l 

Figure 1.4: The formula for Shannon’s entropy, where E is the entropy function, pi 
is the probability that symbol i occurs, and K is any constant. 


1.4 
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A variation of entropy which is used in the context of strings is called a A:-th order 
entropy. It takes the context of k preceding symbols into account and it allows for the 
use of different codes based on this context (e.g., ignoring a symbol C 2 which always 
appears after the symbol ci). Shannon’s entropy corresponds to the case of /c = 0 
(denoted as Hq, or Hk in general). When we increase the k value, we also increase the 
theoretical bound on compressibility, although the size of the data required for storing 
context information may at some point dominate the space ICloRlll . 

1.2.6 Pigeonhole principle 

Let us consider a situation where we have x buckets and n items which are to be posi¬ 
tioned inside those buckets. The pigeonhole principle (often also called Dirichlet prin¬ 
ciple) states that if n > x, then at least one of the buckets must store more than one 
item. The name comes from an intuitive representation of the buckets as boxes and 
items as pigeons. Despite its simplicity, this principle has been successfully applied to 
various mathematical problems. It is also often used in computer science, for example to 
describe the number of collisions in a hash table. Later we will see that the pigeonhole 
principle is also useful in the context of string searching, especially when it comes to 
string partitioning and approximate matching. 


1.3 Overview 

This thesis is organized as follows: 

• Chapter provides an overview of the field of string searching, deals with the 
underlying theory, introduces relevant notations, and discusses related work in the 
context of online search algorithms. 

• Chapter [^includes related work and discusses current state-of-the-art algorithms 
for full-text indexing as well as our contribution to this area. 

• Chapter 1^ does the same for keyword indexes. 

• Chapter describes the experimental setup and presents practical results. 

• Chapter contains conclusions and pointers to the possible future work. 


Appendix A offers information regarding the data sets which were used for the 


experimental evaluation. 
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• Appendix discusses the complexity of exact string comparison. 

• Appendix discusses the g-gram-based compression of the split index (Sec¬ 
tion [A2^ in detail. 

• Appendix [^contains experimental results for string sketches (Section |4.4[ ) when 
used for the alphabet with uniform letter frequencies. 

• Appendix presents the frequencies of English alphabet letters. 

• Appendix [F] contains Internet addresses where the reader can obtain the code for 
hash functions which were used to obtain experimental results for the split index 
(Section |5.2[ ). 





Chapter 2 


String Searching 


This thesis deals with strings, which are sequences of symbols over a specified alphabet. 
The string is usually denoted as El the alphabet as El and the length (size) as|^or \S\ 
for strings, and or |S| for the alphabet. An arbitrary string S is sometimes called 
a word (which is not to be confused with the machine word, i.e. a basic data unit in 
the processor), and it is defined over a given alphabet S, that is S belongs to the set 
of all words specified over the said alphabet, S € T,*. Both strings and alphabets are 
assumed to be finite and well-defined, and alphabets are totally ordered. A string with a 
specified value is written with the teletype font, as in abed. The brackets are usually used 
to indicate the character at a specified position and the index is 0-based, for instance 
if string S = text, then 5[1] = e. A substring (sometimes referred to as a factor) is 
written as ^[zo,*!] (an inclusive range); for the previous example, 5'[1,2] = ex, and a 
single character is a substring of length 1 (usually denoted with c). The last character is 
indicated with -1 and P = P[0, —1]. Si C S 2 indicates that the string 5i is a substring 
of S 2 (conversely. Si S 2 indicates that Si is not a substring of 52 ). The subscripts 
are usually used to distinguish multiple strings, and two strings may be concatenated 
(merged into one), recorded as S = S 1 S 2 or S = Si + S 2 , in which case |5| = l^il -|- |52|. 
Removing one substring from another is indicated with the subtraction sign, S = S 1 —S 2 , 
provided that ^2 C 5i, and as a result |5| = |5i| — occ -1521 for occ occurrences of S 2 
in Si- The equality sign indicates that the strings match exactly, which means that the 
following relation always holds: 5i = S '2 —)• |5i| = 1521. 

String searching, or string matching, refers to locating a substring (a pattern, P, or a 
query, Q, with length m) in a longer text T. The textual data T that is searched is 
called the input (input string, input text, text, database), and its length is denoted by n 
(e.g., 0{n) indicates that the complexity is linear with respect to the size of the original 
data). The pattern is usually much smaller than the input, often in multiple orders of 
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magnitude {m <C n). Based on the position of the pattern in the text, we write that P 
occurs in T with a shift s, i.e. P[0, m — 1] = T[s, s + m — l]As^n — m. As mentioned 
before, the applications may vary (see Section 1.1), and the data itself can come from 
many different domains. Still, string searching algorithms can operate on any text while 
being oblivious to the actual meaning of the data. The held concerning the algorithms 
for string processing is sometimes called stringology. 


Two important notions are prefixes and suffixes, where the former is a substring ^[O, i] 
and the latter is a substring ^[i, |5| — 1] for any 0 ^ i < 151. Let us observe at this 
point that every substring is a prehx of one of the suffixes of the original string, as well 
as a suffix of one of the prehxes. This simple statement is a basis for many algorithms 
which are described in the following chapters. A proper prehx or suffix is not equal to 
the string itself. The strings can be lexicographically ordered, which means that they are 
sorted according to the ordering of the characters from the given alphabet (e.g., for the 
English alphabet, letter a comes before b, b comes before c, etc). Formally, 5i < S 2 for 
two strings of respective lengths ni and n 2 , if 3s : Vi G [0, s) : 5i[i] = 52[i] A 5i[s] < 
S 2 [s] A 0 ^ s < min(nl,n2) or ni < 712 A 5i = 52[0,ni — 1] |CLRS09l p. 304]. When 
it comes to strings, we often mention g-grams and /c-mers, which are lists of contiguous 
characters (strings or substrings). The former is usually used in general terms and the 
latter is used for biological data (especially DNA reads). A 5-gram or 5-mer is a g-gram 
of length 5. 


2.1 Problem classification 

The match between the pattern and the substring of the input text is determined ac¬ 
cording to the specihed similarity measure, which allows us to divide the algorithms into 
two categories: exact and approximate. The former refers to direct matching, where the 
length as well as all characters at the corresponding positions must be equal to each 
another. This relation can be represented formally for two strings Si and S 2 of the same 
length n, which are equal if and only if Vi G [0, n) : 5i[i] = 52[i] (or simply. Si = 82 ). In 
the case of approximate matching, the similarity is measured with a specihed distance, 
also called an error metric, between the two strings. It is to be noted that the word ap¬ 
proximation is not used here strictly in the mathematical sense, since approximate search 
is actually harder than the exact one when it comes to strings |Navni| . In general, given 
two strings Si and S 2 , the distance 0 ( 81 , 82 ) is the minimum cost of edit operations 
that would transform 5i into S 2 or vice versa. The edits are usually dehned as a hnite, 
well-dehned set of rules E = {r : r(S) = S'}, and each rule can be associated with a 
different cost. When error metrics are used, the results of string matching are limited to 
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those substrings which are close to the pattern. This is dehned by the threshold k, that 
is we report all substrings s for which D(s,P) ^ k. For metrics with hxed penalties for 
errors, k is called the maximum allowed number of errors. This value may depend both 
on the data set and the pattern length, for instance for spell checking a reasonable num¬ 
ber of errors is higher for longer words. It should hold that k < m, since otherwise the 
pattern could match any string, and A: = 0 corresponds to the exact matching scenario. 


See Subsection 2.1.1 for detailed descriptions of the most popular error metrics. 


The problem of searching (also called a lookup) can vary depending on the kind of answer 
that is provided. This includes the following operations: 


• Match — determining the membership, i.e. deciding whether P C T (a decision 
problem). When we consider the search complexity, we usually implicitly mean the 
match query. 

• Count — stating how many times P occurs in T. This refers to the cardinality 
of the set containing all indexes i s.t. r[z, z -|- m — 1] is equal to P. Specihc values 
of i are ignored in this scenario. The time complexity of the count operation often 
depends on the number of occurrences (denoted with occ). 

• Locate — reporting all occurrences of P in T, i.e. returning all indexes i s.t. 
T[i, i -|- m — 1] is equal to P. 

• Display — showing k characters which are located before and after each match, 
that is for all aforementioned indexes i we display substrings T[i — k,i — 1] and 
T[i + m^i + m + k — V\. In the case of approximate matching, it might refer to 
showing all text substrings or keywords s s.t. D{s,P) ^ k. 


String searching algorithms can be also categorized based on whether the data is prepro¬ 
cessed. One such classihcation adapted from Melichar et al. |MHP05[ p. 8] is presented 


in Table 2.1 Offline searching is also called index-based searching because we preprocess 
the text and build a data structure which is called an index. This is opposed to online 
searching, where no preprocessing of the input text takes place. For detailed descriptions 
of the examples from these classes, consult Chapters and (offline) and Section 2.2 
(online). 


2.1.1 Error metrics 

The motivation behind error metrics is to minimize the score between the strings which 
are somehow related to each other. Character differences that are more likely to occur 
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Text prepr. 

Pattern prepr. 

Algorithm type 

Examples 

No 

No 

Online 

Naive, dynamic programming 

No 

Yes 

Online 

Pattern automata, rolling hash 

Yes 

No 

Offline 

Index-based methods 

Yes 

Yes 

Offline 

Signature methods 


Table 2.1; Algorithm classification based on whether the data is preprocessed. 

should carry a lower penalty depending on the application area, for instance in the case 
of DNA certain mutations appear in the real world much more often than others. The 
most popular metrics include: 


Hamming distance — relevant for two strings of equal length n, calculates the 
number of differing characters at corresponding positions (hence it is sometimes 
called the fc-mismatches problem). Throughout this thesis we denote the Hamming 
distance with Ham, and given that |Si| = \S 2 \ = n, Ham{Si, S 2 ) = \E\, where 
E = {i : i ^ [0,re) A and Ham{Si, S 2 ) ^ n. Without preprocessing, 

calculating the Hamming distance takes 0{n) time. Applications of the Ham¬ 
ming distance include, i.a., bioinformatics |KCO~*~01 ILSSOlj . biometrics |DFM98] . 
cheminformatics |Flo98) . circuit design |GLPS97j . and web crawling IMJDSOT] . 

Levenshtein distance — measures the minimum number of edits, here defined as 
insertions, deletions, and substitutions. It was first described in the context of 
error correction for data transmission |Lev66| . It must hold that Lev{Si, S 2 ) ^ 
max(|5i|, |5'2|). The calculation using the dynamic programming algorithm takes 


0(|5i|| 52|) time using ©(minds'll, IS 2 I)) space (see Subsection 2.2.2). Ukkonen 


|Ukk85a) recognized certain properties of the DP matrix and presented an algorithm 
with 0(A:min(|Si|, IS 2 I)) time for k errors, and an approximation algorithm in a 
near-linear time was also described |AKO10) . Levenshtein distance is sometimes 
called simply the edit distance. When the distance for approximate matching is 
not explicitly specified, we assume the Levenshtein distance. 

Other edit distances. These may allow only a subset of edit actions, e.g., longest 
common subsequence (jLCSp which is restricted to indels |NW70j or the Episode 


distance with deletions DFG'*~97| . Another approach is to introduce additional 
actions; examples include the Damerau-Levenshtein distance which counts a trans¬ 
position as one edit operation |Bar07) . a distance which allows for matching one 
character with two and vice versa (specifically designed for lOGR.p |GBn8) . or a 
distance which has weights for substitutions based on the probability that a user 
may mistype one character for another |BMnn| . 
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• Sequence alignment — there may exist gaps (other characters) in between sub¬ 
strings of Si and 82 - Moreover, certain characters may match each other even 
though they are not strictly equal. The gaps themselves (e.g., their lengths or 
positions |NR03l IAE86| 1 as well as the inequality between individual characters 
are quantihed. The score is calculated using a similarity matrix which is con¬ 
structed based on statistical properties of the elements from the domain in question 
(e.g., the BLOSUM62 matrix for the sequence alignment of proteins |Eddn4| ). The 
problem can be also formulated as ( 6 , a)-matching, where the width of the gaps 
is at most a and for a set P of positions of corresponding characters, it should 
hold that V(ii,i 2 ) £ P '■ |‘S'i[^i] ~ 5'2[*2]| ^ <5 |CIM'*~n2| . This means that ab¬ 
solute values of numerical differences between certain characters cannot exceed a 
specihed threshold. Sequence alignment is a generalization of the edit distance, 
and it can be also performed for multiple sequences, although this is known to be 
NP-complete |WJ94) . 

• Regular expression matching — the patterns may contain certain metachar¬ 
acters with various meanings. These can specify ranges of characters which can 
match at certain positions or use additional constructs such as the wildcard symbol 
which matches 0 or more consecutive characters of any type. 


2.2 Online searching 

In this section, we present selected algorithms for online string searching, and we divide 
them into exact and approximate ones. Online algorithms do not preprocess the input 
text, however, the pattern may be preprocessed. We assume that the preprocessing time 
complexity is equal to 0(1), and the time required for pattern preprocessing is subsumed 
under search complexity, which means that we consider a scenario where the patterns 
are not known beforehand. Search time refers to the match query. 

2.2.1 Exact 

Faro and Lecroq |FL13| provided a survey on online algorithms for exact matching and 
remarked that over 80 algorithms have been proposed since the 1970s. They categorized 
the algorithms into the following three groups: 

• Character comparisons 


• Automata 
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• Bit parallelism 

The naive algorithm attempts to match every possible substring of T of length m with the 
pattern P. This means that it iterates from left to right and checks whether T[i, m — 1] = 
P for each 0 ^ i ^ n — m. Right to left iteration would be also possible and the algorithm 
would report the same results. Time complexity is equal to 0{nm) in the worst case 
(although to 0{n) on average, see Appendix for more information), and there is no 
preprocessing or space overhead. 

Even without text preprocessing, the performance of the naive algorithm can be improved 
signihcantly by taking advantage of the information provided by the mismatches between 
the text and the pattern. Classical solutions for single-pattern matching include the 
Knuth-Morris-Pratt (IKMPp |KMP77) and the Boyer-Moore (IBMI) |BM77| algorithm. 
The IKMPI uses information regarding the characters that appear in the pattern in order 
to avoid repeated comparisons known from the naive approach. It reduces the time 
complexity from 0{nm) to 0{n + m) in the worst case at the cost of 0{m) space. When 
a mismatch occurs at position i in the pattern (P[i] ^ T[s + i]), the algorithm shifts P 
by i — Z, where I is the length of the longest proper prefix of Ps = P[0,i — 1] which is 
also a suffix of Ps (instead of just 1 position), and it starts matching from the position 
i = I instead of Z = 0. Information regarding Ps is precomputed and stored in a table of 
size m. Let us observe that the algorithm does not skip any characters from the input 
string. Interestingly, Baeza-Yates and Navarro |BYNn4| reported that in practice the 
IKMPI algorithm is roughly two times slower than the brute-force search (although this 
depends on the alphabet size). 

The IBMI algorithm, on the other hand, omits certain characters from the input. It 
begins the matching from the end of the pattern and allows for forward jumps based on 
mismatches. Thanks to the preprocessing, the size of each shift can be determined in 
constant time. One of the two rules for jumping is called a bad character rule, which, 
given that P[i] ^ T[s + i] A T[s + i] = c, aligns T[s -|- i] with the rightmost occurrence 
of c in P {P[j] = c, where j < i), or shifts the pattern by m if c ^ P. The other rule 
is a complex good suffix rule, whose description we omit here, and which is also not a 
part of the Boyer-Moore-Horspool fIBMHI) |Hor8n) algorithm which uses only the bad 
character rule — this is because the good suffix rule requires extra cost to compute and 
it is often not practical. The worst-case time complexity of the IBMI algorithm is equal to 
0{nm + a), with 0{n/ min(m, cj) -|- m -|- cr) average (the same holds for IBMHl) . and the 
average number of comparisons is equal to roughly 3n |Col94j . This can be improved to 
achieve a linear time in the worst case by introducing additional rules |Gal79j . 
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One of the algorithms developed later is the Rabin-Karp (IRKI) |KR87) algorithm. It 
starts with calculating the hash value of the pattern in the preprocessing stage, and then 
compares this hash with every substring of the text, sliding over it in a similar way to the 
naive algorithm. Character-wise verification takes place only if two hashes are equal to 
each other. The trick is to use a hash function which can be computed in constant time for 
the next substring given its output for the previous substring and the next character, a so- 
called rolling hash |Kor 06], viz. H(T[s, s+m—l],T[s+m]) = H(T[s+l, s-|-m]). A simple 

n—1 

example would be to simply add the values of all characters, H{S) = ^[i]. There 

1=0 

exist other functions such as the Rabin fingerprint |R.ab81] , which treats the characters as 

n 

polynomial variables Cj, and the indeterminate x is a fixed base, R{S) = X] The 

i=l 

IRKI algorithm is suitable for multiple-pattern matching, since we can quickly compare the 
hash of the current substring with the hashes of all patterns using any efficient set data 
structure. In this way, we obtain the average time complexity of 0(n -|- m) (assuming 
that hashing takes linear time), however, it is still equal to 0 {nm) in the worst case 
when the hashes do match and verification is required. 

Another approach is taken by the Aho-Corasick |AC75) algorithm. It builds a finite state 
machine (IFSMD . i.e. an automaton which has a finite number of states. The structure 
of the automaton resembles a trie and it contains edges between certain nodes which 
represent the transitions. It is constructed from the queries and attempts to match all 
queries at once when sliding over the text. The transitions indicate the next possible 
pattern which can be still fully matched after a mismatch at a specified position occurs. 
The search complexity is equal to 0{n log a + m + z), which means that it is linear with 
respect to the input length n, the length of all patterns (m, for building the automaton), 
and the number of occurrences z. 

An example of a bit-parallel algorithm is the shift-or algorithm by Baeza-Yates and 
Gonnet |BYG92j . which aims to speed up the comparisons. The pattern length should 
be smaller than the machine word size, which is usually equal to 32 or 64 bits. During the 
preprocessing, a mismatch mask M is computed for each character c from the alphabet, 
where M[i] =0 if P[i] = c and M[i] = 1 otherwise. Moreover, we maintain a state 
mask R, initially set to all Is, which holds information about the matches so far. We 
proceed in a similar manner to the naive algorithm, trying to match the pattern with 
every substring, but instead of character-wise comparisons we use bit operations. At each 
step, we shift the state mask to the left and OR it with the M for the current character 
T[i]. A match is reported if the most significant bit of R is equal to 0. Provided that 
m ^ w, the time complexity is equal to 0 {n) and the masks occupy 0 {a) space. 

Based on the practical evaluation. Faro and Lecroq [FL13] reported that there is no 
superior algorithm and the effectiveness depends heavily on the size of the pattern and 
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the size of the alphabet. The differences in performance are snbstantial — algorithms 
which are the fastest for short patterns are often among the slowest for long patterns, 
and vice versa. 


2.2.2 Approximate 


In the following paragraphs, we nse 6 to denote the complexity of calcnlating the distance 


function between two strings (consnlt Snbsection 2.1.1 for the description of the most 


Donnlar metrics!. Navarro |Nav m presented an extensive snrvey regarding approximate 
online matching, where he categorizes the algorithms into fonr categories which resemble 
the ones presented for the exact scenario: 


• Dynamic programming 

• Antomata 

• Bit parallelism 

• Filtering 


The naive algorithm works in a similar manner to the one for exact searching, that is it 
compares the pattern with every possible snbstring of the inpnt text. It forms a generic 
idea which can be adapted depending on the edit distance which is nsed, and for this 
reason the time complexity is eqnal to 0{n5). 


The oldest algorithms are based on the principle of dynamic programming (IDPp . This 
means that they divide the problem into snbproblems — these are solved and their 
answers are stored in order to avoid recompnting these answers (i.e. it is applicable when 
the snbproblems overlap) |CLRS09I p. 359]. One of the most well-known examples is the 
Needleman-Wunsch fINWp |NW7n) algorithm, which was originally designed to compare 
biological seqnences. Starting from the hrst character of both strings, it snccessively 
considers all possible actions (insertion, (mis)match, deletion) and constrncts a matrix 
which holds all alignment scores. It calcnlates the global alignment, and it can nse a 
snbstitntion matrix which specihes alignment scores (penalties). The sitnation where the 
scores triplet is eqnal to (—1,1 ,—!) (for gaps, matches, and mismatches, respectively) 


corresponds directly to the Levenshtein distance (consnlt Fignre 2.1). The INWl method 
can be invoked with the inpnt text as one string and the pattern as the other. 


A closely related variation of the NW algorithm is the Smith-Waterman (|SWp |SW81| 
algorithm, which can also identify local (and not jnst global) alignments by not allowing 
negative scores. This means that the alignment does not have to cover the entire length 
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of the text, and it is therefore more suitable for locating a pattern as a substring. Both 
algorithms can be adapted to other distance metrics by manipulating the scoring matrix, 
for example by assigning inhnite costs in order to prohibit certain operations. The time 
complexity of the INWl and ISWl approaches is equal to 0 {nm) and it possible to calculate 
them using 0(min(n, m)) space |Hir75| . Despite their simplicity, both methods are still 
popular for sequence alignment because they might be relatively fast in practice and they 
report the true answer to the problem, which is crucial when the quality of an alignment 
matters. 
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Figure 2.1; Calculating an alignment with Levenshtein distance using the Needleman- 
Wunsch (|NWII algorithm. We follow the path from the top-left to the bottom-right 
corner selecting the highest possible score (underlined); the optimal global alignment 
is as follows: text —^ taxi (no gaps). 

Multiple other dynamic programming algorithms were proposed over the years, and they 
gradually tightened the theoretical bounds. The difference lies mostly in their flexibility, 
that is a possibility of being adapted to other distance metrics, as well as practical perfor¬ 
mance. Notable results for the edit distance include the Chang-Lampe |CL92| algorithm 
with 0{knjy/a) average time using 0{m) space, and the Cole-Hariharan |CH02| algo¬ 
rithm with the worst-case time 0(n + m + k^n/m) with c = 3 for non-A:-break periodic 
patterns and c = 4 otherwise, taking 0{m + Og) space, where Og refers to occurrences 
of certain substrings of the pattern in the text (the analysis is rather lengthy). For the 
ILCSI metric. Grabowski |Gral4) provided the algorithm with 0{nm log log n/ log^ n) time 
bound and linear space. 

A signihcant achievement in the automata category is the Wu-Manber~Myers |WMM96] 
algorithm that uses the Four Russians technique, which consists in partitioning the ma¬ 
trix into hxed-size blocks, precomputing the values for each possible block, and then using 
a lookup table. It implicitly constructs the automaton, where each state corresponds to 
the values in the IDPI matrix. They obtain an 0{kn/ \ogn) expected time bound using 
0(n) space. As regards bit parallelism, Myers |Mye99| presented the calculation of the 
DP matrix in 0{\k/w\n) average time. 

An important category is formed by the hltering algorithms, which try to identify parts 
of the input text where it is not possible to match any substrings with the pattern. After 
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parts of the text are rejected, a non-filtering algorithm is used on the remaining parts. 
Numerous filtering algorithms have been proposed, and one of the most significant is the 
Chang-Marr |CM94j algorithm with the time bound of 0{n{k + log^ m)jm) for the error 
level a when it holds that a <1 — ej^/a (for very large a). 

As regards the /c-mismatches problem, a notable example is the Amir-Lewenstein- 
Porat |ALP04j algorithm which can answer the locate query in 0{n^/k log k) time. This 
was refined to 0{n + n-\/k/w\og k) in the word RAM model, where w = r2(logn) |FG09]. 
Recently, Clifford et al. |CFP~*~15| described an algorithm with search time complexity 
0 {nk‘^ log k/m + n polylog m). 


2.3 Offline searching 

An online search is often infeasible for real-world data, since the time required for one 
lookup might be measured in the order of seconds. This is caused by the fact that 
any online method has to access at least n/m characters from the input text [GralU 
p. 155], and it normally holds that m n. This thesis is focused on index-based (offline) 
methods, where a data structure (an index, pi. indexes or indices, we opt for the former 
term) is built based on the input text in order to speed up further searches, which is a 
classic example of data preprocessing. This is justified even if the preprocessing time is 
long, since the same text is often queried with multiple patterns. The indexes can be 
divided into two following categories: 

• Full-text indexes 

• Keyword (dictionary) indexes 

The former means that we can search for any substring in the input text (string matching, 
text matching), whereas the latter operates on individual words (word matching, keyword 
matching, dictionary matching, matching in dictionaries). Keyword indexes are usually 
appropriate where there exist well-defined boundaries between the keywords (which are 
often simply called words), for instance in the case of a natural language dictionary or 
individual DNA reads (/c-mers). It is worth noting that the number of distinct words is 
almost always smaller than the total number of words in a dictionary 2?, all of which are 
taken from a document or a set of documents. Heaps’ law states that \'D\ = 0{n^), where 
n is the text size and fd is an empirical constant (usually in the interval [0.4, 0.6]) |Hea78) . 
Full-text and keyword indexes are actually related to each other, because they are often 
based on similar concepts (e.g., the pigeonhole principle) and they may even use the 
other kind as the underlying data structure. 
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The indexes can be divided into static and dynamic ones, depending on whether npdates 
are allowed after the initial constrnction. Another category is formed by external indexes 


— these are optimized with respect to disk I/O and they aim to be efficient for the data 
which does not ht into the main memory. We can also distingnish compressed indexes 


(see Snbsection 1.2.5.1 for more information on compression), which store the data in 
an encoded form. One goal is to rednce storage reqnirements while still allowing fast 
searches, especially when compared to the scenario where a naive decompression of the 
whole index has to be performed. On the other hand, it is also possible to achieve 
both space saving and a speednp with respect to the nncompressed index. This can 
be achieved mostly dne to rednced I/O and (rather snrprisingly) fewer comparisons 
reqnired for the compressed data |Gralll p. 131]. Navarro and Makinen |NM07) note 
that the most snccessfnl indexes can nowadays obtain both almost optimal space and 
qnery time. A compressed data strnctnre nsnally also falls into the category of a succinct 
data strnctnre. This is a rather loose term which is commonly applied to algorithms 
which employ efficient data representations with respect to space, often close to the 
theoretic bonnd. Thanks to rednced storage reqnirements, snccinct data strnctnres can 
process texts which are an order of magnitude bigger than ones snitable for classical 
data strnctnres |GP14) . The term snccinct may also snggest that the we are not reqnired 
to decompress the entire strnctnre in order to perform a looknp operation. Moreover, 
certain indexes can be classified as self-indexes, which means that they implicitly store 
the inpnt string S. In other words, it is possible to transform (decompress) the index 
back to S, and thns the index can essentially replace the text. 


The main advantage of indexes when compared to the online scenario are fast queries, 
however, this natnrally comes at a price. Indexes might occnpy a snbstantial amonnt 
of space (sometimes even orders of magnitnde more than the inpnt), they are expensive 
to constrnct, and it is often problematic to snpport fnnctionality snch as approximate 
matching and npdates. Still, Navarro et al. |NBYSTni) point ont that in spite of the 
existence of very fast (both from a practical and a theoretical point of view) online 
algorithms, the data size often renders online algorithms infeasible (which is even more 
relevant in the year 2015). 

Index-based methods are explored in detail in the following chapters: fnll-text indexes 
in Ghapter and keyword indexes in Ghapter Experimental evalnation of onr contri- 
bntions can be fonnd in Ghapter 













Chapter 3 


Full-text Indexes 


Full-text indexes allow for searching for an arbitrary substring from the input text. 
Formally, for a string T of length n, having a set of x substrings S = {si,..., Sa,} over a 
given alphabet S, I{S) is a full-text index supporting matching with a specified distance 
D. For any query pattern P, it returns all substrings s from T s.t. D(P, s) ^ k (with 
k = 0 for exact matching). In the following sections, we describe data structures from 
this category, divided into exact (Section |3.1[ ) and approximate ones (Section |3.2[ ). Our 
contribution in this field is presented in Subsection 3.1.6, which describes a variant of 
the well-known FM-index called FM-bloated. 


3.1 Exact 

3.1.1 Suffix tree 


The suffix tree (ISTD was introduced by Weiner in 1973 |Wei73) . It is a trie (see Subsec¬ 
tion 1.2.2.21 which stores all suffixes of the input string, that is n suffixes in total for the 
string of length n. Moreover, the suffix tree is compressed, which in this context means 
that each node which has only one child is merged with this child, as shown in Figure |3T] 
Searching for a pattern takes 0{m) time, since we proceed in a way similar to the search 
in a regular trie. Suffix trees offer a lot of additional functionality beyond string searching 
such as calculating the LempeFZiv compression |Gus971 p. 166] or searching for string 
repeats |AKO02j . The ISTI takes linear space with respect to the total input size (O(n^) 
if uncompressed), however, it occupies significantly more space than the original string 
— in a space-efficient implementation around 10.In bytes on average in practice and 
even up to 20n in the worst case |Kur99[ IAOK02j . which might be a bottleneck when 
dealing with massive data. Moreover, the space complexity given in bits is actually equal 
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to 0(nlogn) (which is also the case for the suffix array) rather than 0(n logcr) required 
to store the original text. When it comes to preprocessing, there exist algorithms which 
construct the ISTI in linear time |Ukk95[ IFar97] . 

As regards the implementation, an important consideration is how to represent the chil¬ 
dren of each node. A straightforward approach such as storing them in a linked list would 
degrade the search time, since in order to achieve the overall time of 0 {m), we have to 
be able to locate each child in constant time. This can be accomplished for example with 
a hash table which offers an 0(1) average time for a lookup. 



Figure 3.1: A suffix tree (ISTI) which stores all suffixes of the text banana with an 
appended terminating character $, which prevents a situation where a suffix could be 

a prefix of another suffix. 

A common variation is called generalized suffix tree and it refers to a ISTI which stores 
multiple strings, that is all suffixes for each string Si,..., Sx- Additional information 
which identifies the string Si is stored in the nodes, and the complexities are the same 
as for a regular ISTI |Gus97l p. 116]. Compressed suffix trees which reduce the space 
requirements were also described (they are usually based on a compressed suffix ar¬ 
ray) |Gogll[[CNPST^ . 


3.1.2 SufRx array 


The suffix array (jSAp comes from Manber and Myers |MM93) and it stores indexes of 


sorted suffixes of the input text, see Figure 3.2 for an example. According to Karkkai- 


nen |Kar95) . suffix arrays perform comparably to suffix trees when it comes to string 
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matching, however, they are slower for other kinds of searches such as regular expression 
matching. Even though the ISAI takes more space than the original string (4n bytes in 
its basic form, and the original string has to be stored as well), it is significantly smaller 
than the suffix tree (5n < 10.Ire) and it has better locality properties |AKO02) . The 
search over a lSAl takes 0(rre log re) time, since we perform a binary search over re suf¬ 
fixes and each comparison takes at most rre time (although the comparison is constant 
on average, see Appendix i). The space complexity is equal to 0(re) (there are re suf¬ 
fixes and we store one index per suffix), and it is possible to construct a ISAI in linear 
time. See Puglisi et al. |PST07j for an extensive survey of multiple construction algo¬ 
rithms with a practical evaluation, which concludes that the algorithm by Maniscalco 
and Puglisi |MP08) is the fastest one. Parallel construction algorithms which use the 
GPU were also considered |DK13) . Let us point out a similarity of ISAI to the ISTI since 
sorted suffixes correspond to the depth-first traversal over the ISTI Main disadvantage 
with respect to the ISTI is a lack of additional functionality such as this mentioned in the 
previous subsection. 


Suffix 

Index 

$ 

6 

3.$ 

5 

ana$ 

3 

anana$ 

1 

banana$ 

0 

na$ 

4 

nana$ 

2 


Figure 3.2; A suffix array (ISAI) which stores indexes (0-based) of sorted suffixes of 
the text bcinana$. The suffixes are not stored explicitly, although the entire input text 

has to be stored. 


3.1.2.1 Modifications 

Multiple modifications of the original ISAI have been proposed over the years. Their aim 
is to either speed up the searches by storing additional information or to reduce space 
requirements by compressing the data or omitting a subset of the data. Most notable 
examples are presented below. 

The enhaneed suffix array (lESAI) is a variant where additional information in the form 
of a longest common prefix (jLCPI) table is stored |AKOn2| . For a suffix array Sa_ over 
the string of length re, the ILCPl table L holds integers from the range [0,re], and it has 
the following properties: L[0] = 0, and L[i] holds the length of the longest common 
prefix of suffixes from SA_[i] and SA[i — !]■ lESAI can essentially replace the suffix tree 
since it offers the same functionality, and it can deal with the same problems in the same 
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time complexity (although constant alphabet size is assumed in the analysis) |AKO04) . 
For certain applications, it also required to store the Burrows-Wheeler transform (see 
Subsection 


3.1.4.11 of the input string and an inverse suffix array (*S'^^[<S'a[*]] = *) 


The size of the index can be always reduced using any compression method. However, 
such a naive approach would certainly have a negative impact on search performance 
because of the overhead associated with decompression, and a much better approach is 
to use a dedicated solution. Makinen IMakOO) presented a compact suffix army (ICoSAp 
with average search time 0(((2n —n')/n')^(m+logn)) where n' is the length of the lCoSAl 
and practical space reduction of up to 50% by replacing repetitive suffixes with links to 
other suffixes. Grossi and Vitter |GVn5| introduced a compressed suffix array (ICSAp 
which uses 0{nloga) bits instead of 0(nlogn) bits. It is based on a transformation of 
the ISAI into the array which points to the position of the next suffix in the text. For 
instance, for the text banana$ and the suffix ana$, the next suffix is na$, see Figure |3^ 
These transformed values are compressible because of certain properties such as the fact 
that number of increasing sequences is in 0(cr). The search takes 0(m/ log^ n + log^ n) 
time, and the relation between search time and space can be fine-tuned using certain 
parameters. 


For more information on compressed indexes, including the modifications of the ISAI we 
refer the reader to the survey by Navarro and Makinen |NMn7) . The FM-index which 
is presented in Subsection 3.1.4| can be also regarded as a compressed variant of the 
[SAl|FMn5) . 
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Figure 3.3: A compressed suffix array (ICSAI) for the text bcLncLna$ which stores indexes 
pointing to the next suffix from the text Ithe lSAI is shown for clarity and it is not stored 

along with the ICSAI) . 

The sparse suffix array stores only suffixes which are located at the positions in the 
form iq for a fixed q value |KU96) . In order to answer a query, q searches and q — 
1 explicit verifications are required, and it must hold that m ^ q. Another notable 
example of a modified suffix array which stores only a subset of data is the sampled 
suffix array |CNP'*~12| . The idea is to select a subset of the alphabet (denoted with 
Sg) and extract corresponding substrings from the text. The array is constructed only 
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over those suffixes which start with a symbol from the chosen subalphabet (although 
the sorting is performed on full suffixes). Only the part of the pattern which contains a 
character c G S 5 is searched for (i.e. there is one search in total) and the matches are 
verihed by comparing the rest of the pattern with the text. The disadvantage is that 
the following must hold, 3c G P : c G S 5 . Practical reduction in space in the order of 
50% was reported. Recently, Grabowski and Raniszewski |GR14| proposed an alternative 


sampling technique based on minimizers (see Section 4.3.1) which allows for matching 
all patterns P s.t. \P\^ q where q is the minimizer window length and requires only one 
search. 


3.1.3 Other sufRx-based structures 

The suffix tray combines — just as the name suggests — the suffix tree with the suffix 
array |GKL06] . The top-level structur e is a isn whose nodes are divided into heavy and 
light, depending on whether their subtrees have more or fewer leaves than some predehned 
threshold. Light children of heavy nodes store their corresponding lS^ interval. The query 
time equals 0 {m + log cr), and preprocessing and space complexities are equal to 0 (n). 
The authors also described a dynamic variant which is called a suffix trist and allows 
updates. 

Yet another modihcation of the classical isn is called suffix eaetus dSGll |Kar95) . Here, 
Karkkainen reworks the compaction procedure, which is a part of the construction of 
the ISTI Instead of collapsing only the nodes which have only one child, every internal 
node is combined with one of its children. Various methods of selecting such a child exist 
(e.g., alphabetical ordering), and thus the lSGI can take multiple forms for the same input 
string. The original article reports the best search times for the DNA, whereas the ISGI 
performed worse than both ISTi and ISAl for the English language and random data. The 
space complexity is equal to 0 {n). 

3.1.4 FM-index 

The FM-index is a compressed (succinct) full-text index which was introduced by Fer- 
ragina and Manzini IFMOO) in the year 2000. It was applied in a variety of situa¬ 
tions, for instance for sequence assembly |LTP'*~09[ ISD10| or for ranked document re¬ 
trieval |GPS12) . Multiple modihcations of the FM-index were described throughout the 
years (some are introduced in the following subsections). The strength of the original 
FM-index lies in the fact that it occupies less space than the input text while still allow¬ 
ing fast queries. The search time of its unmodihed version is linear with respect to the 
pattern length (although a constant-size alphabet is assumed), and the space complexity 
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is equal to 0{Hk{T) + log log n/log n) bits per input symbol. Taking the alphabet size 
into account, Grabowski et al. |GNP'*~n 6 ) provide a more accurate total size bound of 
0{Hk{T)n + (fjloga + loglogn)j^ + bits for 0 < 7 < 1. 


3.1.4.1 Burrows—Wheeler Transform 


FM-index is based on the Bur rows-Wheeler transform (IBWTI) |BW94) . which is an inge¬ 
nious method of transforming a string S in order to reduce its entropy. IBWTI permutes 
the characters of S in such a way that duplicated characters often appear next to each 
other, which allows for easier processing using methods such as run-length or move- 
to-front encoding (as is the case in, e.g., the bzip2 compressor |Sew[ IPJTOh] ). Most 
importantly, this transformation is reversible (as opposed to straightforward sorting), 
which means that we can extract the original string from the permuted order. IBWTI 
could be also used for compression based on the fc-th order entropy (described in Sub¬ 


section 1.2.5.1 1 since basic context information can be extracted from BWT, however. 


the loss of speed renders such an approach impractical |Deo05) . 


In order to calculate the IBWTl we hrst append a special character (we describe it with 
$, but in practice it can be any character c ^ S') to 5 in order indicate its end. The 
character $ is lexicographically smaller than all {c : c G S}. The next step is to take all 
rotations of S (|S| rotations in total) and sort them in a lexicographic order, thus forming 
the IBWTI matrix, where we denote the hrst column (sorted characters) with F and the 
last column (the result of the IBWTl i.e. T^"'*) with L. In order to hnish the transform, 
we take the last character of each rotation, as demonstrated in Figure |T^ Let us note 
the similarities between the IBWTI and the suffix array described in Subsection 3.1.2 


since the sorted rotations correspond directly to sorted suffixes (see Figure 3.5). The 
calculation takes 0 {n) time, assuming that the prehxes can be sorted in linear time, 
and the space complexity of the naive approach is equal to O(n^) (but it is linear if 
optimized). 


In order to reverse the IBWTl we hrst sort all characters and thus obtain the hrst column 
of the matrix. At this point, we have two columns, namely the hrst and the last one, 
which means that we also have all character 2 -grams from the original string S. Sorting 
these 2 -grams gives us the hrst and the second column, and we proceed in this manner 
(later we sort 3-grams, 4-grams, etc) until we reach |S'|-grams and thus reconstruct the 
whole transformation matrix. At this point, S can be found in the row where the last 
character is equal to $. 































Full-text Indexes 


32 


Ri 

R2 

R3 

i?4 

R5 

i?6 

Ry 

Rs 


$ p 
a t 
e r 
n $ 
p a 
r n 
t e 
t t 


a t 
t e 
n $ 
p a 
t t 

$ p 
r n 
e r 


t e 
r n 
p a 
t t 
e r 
a t 

$ p 
n $ 


r 

$ 

t 

e 

n 

t 

a 

P 


n 

P 

t 

r 

$ 

e 

t 

a 


Figure 3.4: Calculating a Burrows-Wheeler transform (jBWTII for the string pattern 
with an appended terminating character $ (it is required for reversing the trans¬ 
form). The rotations are already sorted and the result is in the last column, i.e. 

BWT(pattern$) = nptr$eta. 


3.1.4.2 Operation 

Important aspects of the FM-index are as follows: 


Count table C, which describes the number of occurrences of lexicographically 


smaller characters for all c G S' (see Figure 3.6). 


• Rank operation, which counts the number of set bits in a bit vector v before a 
certain position i (we assume that u[z] is included as well), that is rank{i,v) = 
|{i' : 0 ^ i' ^ i A v[i'] = 1}|. 


• Select operation (used only in some variants, e.g., the RLFM [MNOBb) ). which 
reports the position of the i-th set bit in the bit vector v, that is select{i, v) = p ii 
and only if \{i' : 0 ^ z' < p A v[i'] = 1}| = i — 1. 


Note that both rank and select operations can be generalized to any hnite alphabet E. 
When we perform the search using the FM-index, we iterate the pattern characterwise 
in a reverse order while maintaining a current range r = [s, e]. Initially, i = m — 1 and 
r = [ 0 , n — 1 ], that is we start from the last character in the pattern and the range covers 
the whole input string (here the input string corresponds to , that is a text after the 


BWT). At each step we update s and e using the formulae presented in Figure 3.7 The 
size of the range after the last iteration gives us the number of occurrences of P in T, 
or it turns out that P 5 ZI T if s > e at any point. This mechanism is also known as the 
LF-mapping. 


3.1.4.3 Efficiency 

We can see that the performance of C lookup and rank is crucial to the complexity of 
the search procedure. In particular, if these operations are constant, the search takes 
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Figure 3.5: A relation between the IBWTl and the ISAI for the string pattern with 
an appended terminating character $. Let us note that BWT[i\ = S'[S'A[i] — 1] (where 
S'[—1] corresponds to the last character in S), that is a character at the position i in 
IBWTl is a character preceding a suffix which is located at the same position in the ISAI 
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Figure 3.6: Count table C which is a part of the FM-index for the text mississippi$. 
The entries describe the number of occurrences of lexicographically smaller characters 
for all c G S. For instance for the letter m, there are 4 occurrences of i and 1 occurrence 
of $ in S, hence (^[m] =5. It is worth noting that C is actually a compact representation 

of the F column. 


s = + rank{s — 1, 

e = + rank{e, P[i]) — 1 

Figure 3.7: Formulae for updating the range during the search procedure in the FM- 
index, where P[i] is the current character and C is the count table. Rank is invoked 
on and it counts occurrences of the current character P\i]. 


0{m) time. For the count table, we can simply precompute the values and store them 
an array of size a with 0(1) lookup. As regards rank, a naive implementation which 
would iterate the whole array would clearly take 0(n) time. On the other hand, if we 
were to precompute all values, we would have to store a table of size 0{na). One of the 
popular solutions for an efficient rank uses two structures which are introduced in the 
following paragraphs. 


The RRR (from authors’ names: Raman, Raman, and Rao) is a data structure which 
can answer the rank query in 0(1) time for bit vectors (i.e. where S = {0,1}), while 
providing compression at the same time |R R.R.n2) . It divides a bit vector v of size n into 
n/b blocks each of size h, and groups each consecutive s blocks into one super block (see 


Figure 3.8). For each block, we store a weight w which describes the number of set bits 
and offset o which describes its position in a table (the maximum value of o depends 
on w). In Tji, for each w and each o, we store a value of rank for each index i, where 
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0 ^ i ^ 6 (see Figure 3.9). This means that we have to keep (^) entries each of size b for 
each of the 6+1 consecutive weights. Such a scheme provides compression with respect 
to storing all n bits explicitly. We achieve the 0(1) query time by storing a rank value 
for each superblock, and thus during the search we only iterate at most s blocks (s is 
constant). The space complexity is equal to nHQ{v) + 0(n log log n/log n) bits. 


I 001 I 001 II 101 I 000 I 

Figure 3.8: An example of RRR blocks for 6 = 3 and s = 2, where the first superblock 
is equal to 001001 and the second superblock is equal to 101000. 
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Figure 3.9: An example of an RRR table for re = 2 and 6 = 3, where ( 2 ) = 3 (i.e. the 
number of all block values of length 3 with weight 2 is equal to 3) , with rank presented 
for successive indexes i G [0, 2]. Block values do not have to be stored explicitly. 


The wavelet tree (jWTI) from Gross! et al. |GGV03j is a balanced tree data structure that 
stores a hierarchy of bit vectors instead of the original string, which allows the use of 
RRR (or any other bit vector with an efficient rank operation). Starting from the root, 
we recursively partition the alphabet into two subsets of equal length (if the number 
of distinct characters is even), until we reach single symbols which are stored as leaves. 
Gharacters belonging to the first subset are indicated with Os, and characters belonging 


to the second subset are indicated with Is (consult Figure 3.10 for an example). 


Thanks to the IWTI we can implement a rank query for any fixed size alphabet in O(logcr) 
time (assuming that a binary rank is calculated in constant time), since the height of 
the tree is equal to logo". For a given character c, we query the IWTl at each node and 
proceed left or right depending on the subset to which c belongs. Each subsequent rank 
is called in the form rank{c,p), where p is the result of the rank at the previous level. 
Ferragina et al. |FMMN07) described generalized IWTk for instance a multiarv IWTl with 
Oflogcr/loglognl traversal time (consult Bowe’s thesis [Bow m for more information 
and a practical evaluation). 


3.1.5 FM-index flavors 

Multiple flavors of the FM-index were proposed over the years with the goal of decreasing 
the query time, e.g., having 0 {m) time without the dependence on n, or reducing the 
occupied space. The structures which provide asymptotically optimal bounds are often 
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Figure 3.10: A wavelet tree (jWTI) over the string abracadabra. The alphabet is 
divided into two subsets at each level, with 0 corresponding to one subset and 1 to the 

other. 

not practical due to the very large constants which are involved |Vig08| . For this reason, 
many authors focus on practical performance, and these structures are usually based 
on a fast rank operation and take advantage of compressed representations of bit vec¬ 
tors |NMn7) . The following paragraphs present selected examples, consult Navarro and 
Makinen |NM07) for an extensive survey of compressed full-text indexes which discusses 
the whole family of FM-indexes. 

One of the notable examples where the query time does not depend on a is the alphabet- 
independent FM-index by Grabowski et al. |GMN04] . The idea is to hrst compress the 
text using Huffman coding and then apply the IBWTl transform over it, obtaining a bit 
vector V. This vector is then used for searching in a manner corresponding to the FM- 
index — the array C stores the number of zeros up to a certain position, and the relation 
C[c] -|- rank{T^'^^,i,c) is replaced with i — rank{v,i) if c = 0 and n' — rank{v,n') -\- 
rank{v,i) if c = 1, where n' is the length of the text compressed with Huffman. The 
space complexity is equal to 0 {n{HQ(T) -F 1 )) bits, and the average search time is equal 
to 0 {m{HQ{T) -F 1 )) under “reasonable assumptions”. 

On the practical front, Grabowski et al. |GR.D15) recently described a cache-aligned rank 
with 1 cache miss. Moreover, they proposed the so-called FM-dummy index with several 
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variants, for instance one which stores a separate bit vector for each alphabet symbol 
(cr vectors in total). Other variants include using certain dense codes as well as using 
Huffman-shaped multiary wavelet trees with different arity values. A Huffman-shaped 
wavelet tree is unbalanced, and the paths for frequent characters are shorter (which 
translates to a smaller number of rank queries on bit vectors). Moreover, the operations 
(which are performed in the same manner as for the regular wavelet tree) are faster on 
average |CN014j . They reported search times which are 2-3 times faster than those for 
other state-of-the-art methods at the cost of using 1.5-5 times more space. 

Data structures which concentrate on reducing space requirements rather than the query 
time include the compressed bit vectors from Karkkainen et al. [KKP14) . where different 
compression methods are used for blocks depending on the type of the block, for instance 
run-length encoding for blocks with a small number of runs. Another notable example is 
a data-aware FM-index by Huo et al. |HCZ'*~15| . which encodes the bit vectors (resulting 
from the IWTl) using Gamma coding (a kind of variable-length coding), and thus obtain 
one of the best compression ratios in practice. 

3.1.5.1 Binary rank 

As described in the previous subsection, in order to achieve good overall performance, it is 
sufficient to design a data structure which supports an efficient rank query for bit vectors 
thanks to the use of a wavelet tree (RRR being a notable example). Jacobson |Jac89| 
originally showed that it is possible to obtain a constant-time rank operation using o(n) 
extra bits for Ir;! = n (the same holds for select |Cla96) l. Vigna |Vig08| proposed to 
interleave (i.e. store next to one another) blocks and superblocks (concepts which were 
introduced for the RRR structure) for uncompressed bit vectors in order to reduce the 
number of cache and translation lookaside buffer (jTLBp misses from 3 to 2. This was 
extended by Gog and Petri [HPU] . who showed better practical performance by using 
a slightly different layout with 64-bit counters. Gonzalez and Navarro |GNn9) provided 
a discussion of the dynamic scenario where insertions and deletions to the vector are 
allowed, and they obtain a space bound of nHo{v) o(n log cr) bits and 0(logn(l -|- 
logcj/loglogn)) time for all operations (i.e. queries and updates). 

3.1.6 FM-bloated 

One of the crucial issues when it comes to the performance of the FM-index is the 
number of GPU cache misses which occur during the search. This comes from the fact 
that in order to calculate the LF-mapping, non-local access to the IBWTI sequence is 
often required (in the order of H(m) misses during the search for a pattern of length m). 
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even for a small alphabet. The problem of cache misses during the FM-index backward 
search was identihed as the main performance limiter by Chacon et al. |CMEH13) . who 
proposed to perform the LF-mapping with several symbols at a time (in practice, at most 
4 for the 4-symbol DNA alphabet for which the scheme was described). This solution 
allowed, for example, to improve the search speed by a factor of 1.5 for the price of 
occupying roughly 2 times the size of the 1-step FM-index. Here, we address the problem 
of cache misses during the pattern search {count query) in a way related to the Chacon 
et al. solution — we also work on g-grams, yet the algorithmic details are different. Two 
following subsections describe two variants of our approach, and experimental results can 
be found in Section [5.11 


3.1.6.1 Superlinear space 


FM-bloated is a variation of the FM-index which aims to speed up the queries at the cost 
of additional space. We start by calculating the IBWTI for the input string in the same 
way as for the regular FM-index, however, the difference is that we operate on g-grams 
rather than on individual characters, and the count table stores results for each g-gram 
sampled from the IBWTI matrix. This is the case for all q, where q is the power of 2, 
up to some predehned value qmax (for instance 128). Namely, for each suffix T[i,n — 1], 
we take all g-grams in the following form: T[i — 1] (1-gram), T[i — 2,i — 1] (2-gram), 
T[i — 4,z — 1] (4-gram), etc. The g-grams are extracted until we reach qmax or one 
of the g-grams contains the terminating character (such a g-gram with the terminating 


character is discarded), consult Figure 3.11 for an example. Let Q denote a collection of 
all g-grams for all i. For each distinct item s from Q, we create a list Lg of its occurrences 
in the sorted suffix order (simply called ISAl order!. This resembles an inverted index on 
g-grams, yet the main difference is that the elements in the lists are arranged in ISAl rather 


than the text order, e.g., for the g-gram t in Figure 3.11 the 1-based list of occurrences 
corresponding to rows would be as follows: {3, 7}. 
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Figure 3.11: Q-gram extraction in the FM-bloated structure with superlinear space 
for the text T = pattern$. All g-grams are extracted, i.e. qmax ^ [^721 • 
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For a given pattern P, we start the LF-mapping with its longest suffix Pg s.t. \Ps\ ^ 
Qmax A |Ps| =2^ for some c G Z. The following backward steps deal with the remaining 
prefix of P in a similar way. Note that the number of LF-mapping steps is equal to the 
number of Is in the binary representation of m, i.e. it is in the order of O(logm), and if m 
is a power of two, then the result for match and count queries can be reported in constant 
time (we simply return |Ls|). When Qmax is bigger, the overall index size is bigger, 
but the search is faster (for patterns of sufficient length) because it allows for farther 
jumps towards the beginning of the pattern. In our representation, each LF-mapping 
step translates to performing two predecessor queries on a list Lg. A naive solution is 
a binary search with O(logn) worst-case time (or even a linear search, which may be 
faster if the list is short), yet the predecessor query can be also handled in O(loglogn) 
time using a y-fast trie |Wil83| . Hence, the overall average search complexity is equal 
to 0 {m -|- logmloglogn), with 0 {mjCL + logmloglogn) cache misses where Cl is the 
cache line size in bytes (provided that each symbol from the pattern occupies one byte). 
As regards the space complexity, there is a total of n log re g-gram occurrences (log re 
g-gram positions for each of re rows of the IBWTl matrix!. Hence, the total length of all 
occurrence lists is equal to re log re, and the total complexity is equal to O(relog^re) bits 
(since we need log re bits to store one position from the BWT matrix of re rows). 

As regards the implementation (in the C++ language), our focus is on data compaction. 
Each g-gram acts as a key in a hash table where collisions are resolved with chaining, 
and the g-grams are stored implicitly, i.e. as a (pointer, length) pair, where the pointer 
refers to the original string. The values in the hash table include the count and the 
list of occurrences which are stored in one, contiguous array. We use a binary search 
for calculating rank on lists whose length is greater than or equal to 16 (an empirically 
determined value) and a linear search otherwise. 


3.1.6.2 Linear space 


In this variant, instead of extracting all 1-, 2-, etc, g-grams for each row of the IB WTl 
matrix, we extract only selected g-grams with the help of minimizers (consult Subsec¬ 


tion 4.3.1 for the description of minimizers). The first step is to calculate all (a,g)- 
minimizers for the input text T, with some fixed a and g parameters and lexicographic 
ordering, where ties are resolved in favor of the leftmost of the smallest substrings. Next, 
we store both the count table and the occurrence lists for all single characters in the same 
way as for the regular FM-index (using, e.g., a wavelet tree). Moreover, we store infor¬ 
mation about the counts and occurrences of all g-grams which are located in between 
the minimizers from the set M{T) — these g-grams are referred to as phrases. For the 
set of minimize!' indexes T(T), consecutive phrases pi are constructed in the following 
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manner: 


r: ^pi = T\l[i],Z[i + 1] — 1], consult Figure 3.12 It is worth noting that this 


approach resembles the recently proposed SamSAMi index, a sampled suffix array on 
minimizers |GR14j . 


T 

appearance 

M 

ap, ar, an 

X 

0,4,6 

Phrases 

appe, ar 

Phrase ranges 

[0,3], [4,5] 


Figure 3.12: Constructing FM-bloated phrases for the text appearance with the use 

of (4, 2)-minimizers. 


The search proceeds as follows: 

1. We calculate all minimizers for the pattern. 

2. We search for the pattern suffix Ps = — 1], where Sr is the starting position 

of the rightmost minimizer using the regular FM-index mechanism, i.e. processing 
1 character at a time. 

3. We operate on the phrases between the minimizers rather than individual char¬ 
acters and the search for these g-grams is performed in the same way as for the 
superlinear variant. If it turns out that the phrase is a 1-gram, a faster FM-index 
mechanism for single characters can be used. 

4. We search for the pattern prefix Pp = P[0, s/ — 1], where si is the starting position 
of the leftmost minimizer using the regular FM-index mechanism, i.e. processing 1 
character at a time. 

The use of minimizers ensures that the phrases are selected from P in the same way 
as they are selected from T during the index construction. The overall average search 
complexity is equal to 0(m log log n) (again, assuming that a y-fast trie |Wil83| is used), 
and the space complexity is linear. 


3.2 Approximate 


Navarro et al. |NBYSTni) provided an extensive survey of full-text indexes for approx¬ 
imate string matching. They categorized the algorithms into three categories based on 
the search procedure: 
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• Neighborhood generation — all strings in {S : S £ Tj* A D{S,P) ^ k} for a 
given pattern P are searched for directly. 

• Partitioning into exact searching (jPiESj) — snbstrings of the pattern are 
searched for in an exact manner and these matches are extended into approximate 
matches. 

• Intermediate partitioning — snbstrings of the pattern are searched for approx¬ 
imately bnt with a fewer nnmber of errors. This method lies in between the two 
other ones. 


In the neighborhood generation approach, we generate the ^-neighborhood K of the 
pattern, which contains all strings which conld be possible matches over a specihed 
alphabet S (if the alphabet is hnite, the amonnt of snch strings is hnite as well). These 
strings can be searched for nsing any exact index snch as a snfhx tree or a snfhx array. The 
main issne is the fact that the size of K grows exponentially, \K\ = 0{m^a^) |Ukk85b] . 
which means that basically all factors (and especially k) shonld be small. When the 
snfhx tree is nsed as an index for the inpnt text, Cobbs |Cob95) proposed a solntion 
which rednces the amonnt of nodes that have to be processed. It rnns in 0{mq |o|) 
time and occnpies 0 {q) space, where q ^ n {q depends on the problem instance) and |o| 
is the size of the ontpnt. 


When the pattern is partitioned and searched for exactly (IPiESI) . we have to again store 
the index which can answer these exact qneries. Let ns note that this approach is based 
on the pigeonhole principle. In the context of approximate string searching this means 
that for a given k, at least one of k 1 parts of average length \P\/{k -f- 1) mnst match 
the text exactly (more generally, s parts match if A: -|- s parts are created). The valne 
of k shonld not be too large, otherwise it conld be the case that a snbstantial part 
of the inpnt text has to be verihed (especially if the pattern is small). Alternatively, 
the pattern can be divided into m — q 1 overlapping g-grams and these g-grams are 
searched for (nsing the locate qnery) against the index of ( 7 -grams extracted from the 
text (see Fignre [3.13 for an example of g-gram extraction). These g-grams which are 
stored by the index are situated at hxed positions with an interval h and it mnst hold 
that h ^ [{m — k — q-\-l)/{k-\-s)\ for occnrrences of P in T to contain s samples. Sntinen 
and Tarhio |ST95) snggested that the optimal valne for g is in the order of 0(log^m). 
If it tnrns ont that the positions of snbseqnent g-grams may correspond to a match, 
explicit verihcation is performed. Similarly to the /^-neighborhood scenario, any index 
can be nsed in order to answer the exact qneries. Let ns note that this approach with 
pattern snbstring looknp and verihcation can be also nsed for exact searching. 
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In the case of intermediate partitioning, we split the pattern into s pieces, and we search 
for these pieces in an approximate manner using neighborhood generation. The case 
of s = 1 corresponds to pure neighborhood generation, whereas the case of s = /c + 1 
is almost like IPiESI In general, this method requires more searching but less verifica¬ 
tion when compared to IPiESI and thus lies in between the two approaches which were 
previously described. 


Consult Maa£ and Nowak |MN05aj in order to see a detailed comparison of the complex¬ 
ities of modern text indexing methods for approximate matching. Notable structures 
from the theoretical point of view include the k-errata trie by Cole et al. |CGL04| which 


is based on the suffix tree and the LCP structure (see Subsection 3.1.2.1 for a description 
of the LCP). It can be used in various contexts, including full-text and keyword indexing 
as well as wildcard matching. For full-text indexing and the A:-mismatches problem, it 
uses 0{n\o^ n/k\) space and offers 0{m log^ n/k\ -F occ) query time. This was ex¬ 
tended by Tsur |TsulO) who described a structure similar to the one from Cole et al. with 
time complexity 0(m-|-log log n-|-occ) (for constant k) and space for a constant 

e > 0. As regards a solution which is dedicated for the Hamming distance, Gabriele et 
al. |GMRS0^ provided an index with average search time 0{m occ) and O(nlog^n) 
space (for some /). 


Let us note that these full-text indexes can be usually easily adapted to the keyword 
matching scenario which is described in the following chapter. 


3.3 Word-based 


An interesting category of data structures are word-based indexes, which can be used 
for approximate matching and especially sequence alignment. They employ heuristic 
approaches in order to speed up the searching, and for this reason they are not guar¬ 
anteed to find the optimal match. This means that they are also approximate in the 
mathematical sense, i.e. they do not return the true answer to the problem. Their popu¬ 
larity is especially widespread in the context of bioinformatics, where the massive sizes of 
the databases often force the programmers to use efficient filtering techniques. Notable 


examples include BLAST AGM'*~9n and FASTA |LP85) tools. 


3.3.1 BLAST 


BLAST stands for Basic Local Alignment Search Tool, and it was published by Altschul 


et al. AGM'*~9n in 1990 with the purpose of comparing biological sequences (see Sub¬ 


section 1.1.2 for more information about biological data). The name may refer to the 
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algorithm or to the whole suite of string searching tools for bioinformatics which are based 
on the said algorithm. BLAST relies heavily on various heuristics and for this reason it 
is highly domain specific. In fact there exist various flavors of BLAST for different data 
sets, for instance one for protein data (blastp) and one for the DNA (blastn). Another 
notable modification is the PSI-BLAST which is combined with dynamic programming 
in order to identify distant protein relationships. 

The basic algorithm proceeds as follows: 


1. Certain regions are removed from the pattern. These include repeated substrings 
and regions of low complexity (measured statistically using, e.g., DUST for DNA 

iMCSAobl i. 

2. We create a set Q containing ( 7 -grams with overlaps (that is all available g-grams, 
see Figure [ 3 .13[ ) which are extracted from the pattern. Each s G Q is scored against 
all possible ( 7 -grams (these can be precomputed) and ones with the highest scores 
are retained creating a candidate set Qc- 


3. Each word from Qc is searched for in an exact manner against the database (using 


for instance an inverted index, see Subsection 4.1.2). These exact matches create 
the seeds which are later used for extending the matches. 


4. The seeds are extended to the left and to the right as long as the alignment score 
is increasing. 


5. Alignment significance is assessed using domain-specific statistical tools. 


size 

(7-grams 

2 

te, ex, xt, ti, in, ng 

3 

tex, ext, xti, tin, ing 

4 

text, exti, xtin, ting 

5 

texti, extin, xting 


Figure 3.13: Selecting all overlapping ( 7 -grams (with the shift of 1) from the text 
T = texting. It must always hold that q ^ |r|. 


In general, BLAST is faster than other alignment algorithms such as the ISWI algorithm 


(see Subsection 2.2.2) due to its heuristic approach. However, this comes at a price of re¬ 
duced accuracy, and Shpaer et al. |SRY^96) state that there is a substantial chance that 
BLAST will miss a distant sequence similarity. Moreover, hardware-oriented implemen¬ 
tations of the SW have been created, and in certain cases they can match the performance 
of BLAST |MV08) . Still, BLAST is currently the most common tool for sequence align¬ 
ment using massive biological data, and it is openly available via its website |BLA| . which 
means that it can be conveniently run without consuming local resources. 


















Chapter 4 


Keyword Indexes 


Keywords indexes operate on individnal words rather than the whole inpnt string. For¬ 
mally, for a collection V = {di,... ,dx} of x strings (words, g-grams) of total length n 
over a given alphabet S, lij^) is a keyword index snpporting matching with a specified 
distance D. For any qnery pattern P, it retnrns all words d from D s.t. D(P, d) ^ k 
(with k = 0 for exact matching). Approximate dictionary matching was introdnced by 
Minsky and Papert in 1969 |MP69[ ICGL04] . 


In the following sections, we describe algorithms from this category, divided into exact 


(Section 4.1) and approximate ones (Section 4.2). Onr contribntion in this field is pre¬ 
sented in Snbsection |4.2.3| which describes an index for approximate matching with few 
mismatches (especially 1 mismatch). 


4.1 Exact 


If the goal were to snpport only the match qnery for a finite nnmber of keywords, we 
conld nse any efficient set data strnctnre snch as a hash table or a trie (see Snbsec- 


tions 1.2.3 and 1.2.2.2) in order to store all those keywords. Boytsov |Boyll| reported 
that depending on the data set, either one of these two may be faster. In order to rednce 


space reqnirements we conld nse minimal perfect hashing (see Snbsection 1.2.3), and we 
conld also compress the entries in the bnckets. 


4.1.1 Bloom filter 

Alternatively, we conld provide only approximate answers (in a mathematical sense) in 
order to occnpy even less space. A relevant data strnctnre is the Bloom filter (|BFp |Blo7n) , 
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which is a space-efficient, probabilistic data structure with possible false positive matches, 
but no false negatives and an adjustable error rate. The IBFI uses a bit vector A of size n, 
where no bits are initially set. Each element e is hashed with k different hash functions Hi 
in the form H[e) = i, where zGZA0^i<n, and ^i=iA[Hi{e)] = 1. When the lookup 
is performed, the queried element is hashed with the same functions and it is checked 
whether A[i] = 1 for all i, and if that is the case a possible match is reported (consult 


Figure 4.1). Broder and Mitzenmacher [BMOdj provided the following formula for the 
expected false positive rate: Fp = 0.5^ ^ where m is the size of the filter in 

bits and n is the number of elements. They note that for example when m = 8n, the false 
positive probability is slightly above 0.02. Recently, Fan et al. |FAKM14] described a 
structure based on cuckoo hashing which takes even less space than the IBFI and supports 
deletions (unlike the IBFD . 

{x,y, z} 



0 

1 

0 

1 

1 

1 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 



Figure 4.1: A Bloom filter (IBFI) for approximate membership queries with n = 18 
and k = 3, holding the elements from the set {x, y, z}. The element w is not in the set 
since H 2 {w) = 15 and A[15] = 0; reproduced from Wikimedia Common^ 


4.1.2 Inverted index 


An inverted index is a keyword index which contains a mapping from words d G 2? to the 
lists which store all positions pi of their occurrences in the text (d —>■ pi,... ,Pn)- These 
positions can be for instance indexes in a string of characters, or if a more coarse-grained 
approach were sufficient, they could identify individual documents or databases. See 
Figure [4^ for an example with a single input string. The positions allow a search on the 
whole phrase (i.e. multiple words) by searching for each word separately and checking 
whether the positions describe consecutive words in the text (that is by looking for list 
intersections with a shift). It could be also used for searching for a query which may cross 
the boundaries of the words by searching for substrings of a pattern and comparing the 


respective positions (consult Section 4.3 for more information). This means that the goal 


of an inverted index is to support various kinds of queries (e.g., locate, see Section 2.1) 
efficiently. 

^David Eppstein, available at http://en.wikipedia. 0 rg/wiki/Bl 00 m_filter#/media/ 
File :Bloom_f ilter. svg, in public domain. 
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Word 

Occurrence list 

This 
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is 

12 

a 

15 

banana 

5, 17 


Figure 4.2: An inverted index which stores a mapping from words to their positions 
(0-based) in the text “This banana is a banana”. 

Main advantage of the inverted index are fast single-word queries, which can be answered 
in constant average time using for example a hash table (an inverted index is a rather 
generic idea, which means that it could be also implemented with other data structures 
such as binary trees). On the other hand, there is a substantial space overhead (in the 
order of 0{n)) and the original string has to be stored as well. For this reason, one of the 
key challenges for inverted indexes is how to succinctly represent the lists of positions 
while still allowing fast access. Multiple methods were proposed, and they are often 
combined with one other |AM05j . 

The most popular one is to store gaps, that is differences between subsequent positions. 
For the index in Figure [T^ the list for banana would be equal to {5,12} instead of {5,17} 
(17 — 5 = 12). The values of the gaps are usually smaller than the original positions and 
for this reason they can stored using a fewer amount of bits. Another popular approach 
is to use byte-aligned coding. Here, each byte contains a one-bit flag which is set if the 
number is bigger or equal to 2^ (that is when it does not fit into 7 bits), and the other 
seven bits are used for the data. If the number does not fit, 7 least-significant bits are 
stored in the original byte, and the algorithm tries to store the remaining bits in the 
next byte, proceeding until the whole number has been exhausted. In order to reduce 
the average length (in bits) of the occurrence list, one could also divide the original 
text into multiple blocks of fixed size. Instead of storing exact positions only block 
indexes are stored, and after the index is retrieved the word is searched for explicitly 
within the block |MW94b] . If the size of the data is so massive that it is infeasible to 
construct a single index (as is often the case for web search engines), sometimes only 
the most relevant data is selected for being stored in the index (thus forming a pruned 
index) |NCn7) . 

4.2 Approximate 

Boytsov IBoylll presented an extensive survey of keyword indexes for approximate 
searching (including a practical evaluation). He divided the algorithms into two fol¬ 
lowing categories: 
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Direct methods — like neighborhood generation (see Section 3.2), where certain 
candidates are searched for exactly. 


• Sequence-based filtering methods — the dictionary is divided into many (dis¬ 
joint or overlapping) clusters. During the search, a query is assigned to one or 
several clusters containing candidate strings and thus an explicit verihcation is 
performed only on a fraction of the original dictionary. 


Notable results from the theoretical point of view include the k-errata trie by Cole 
et al. |CGL04) which was already mentioned in the previous chapter. For the Ham¬ 
ming distance and dictionary matching, it uses 0{n + d ) space and offers 0 (m -|- 

log log n + occ) query time, where d = |P| (this also holds for the edit distance 
but with larger constants). Another theoretical work describing the algorithm which is 
similar to our split index (which we describe in Subsection 4.2.3) was given by Shi and 
Widmayer |SW96j . who obtained 0(n) preprocessing time and space complexity and 
0(n) expected time if k is bounded by 0(m/logm). They introduced the notion of 
home strings for a given g-gram, which is the set of strings in T> that contain the g-gram 
in the exact form (the value of q is set to |P|/(A: +1)). In the search phase, they partition 
P into k + 1 disjoint ( 7 -grams and use a candidate inspection order to speed up hnding 
the matches with up to k edit distance errors. 


On the practical front, Bocek et al. |BHSHn7) provided a generalization of the Mor- 
Fraenkel (IMFI) |MF82| algorithm for k ^ 1 which is called FastSS. To check if two strings 
Si and S 2 match with up to k errors, we hrst delete all possible ordered subsets of k' 
symbols for all 0 ^ P ^ k from Si and S 2 . Then we conclude that Si and S 2 may be 
in edit distance at most k if and only if the intersection of the resulting lists of strings is 
non-empty (explicit verihcation is still required). For instance, if Si = abbac and k = 2, 
then its neighborhood is as follows: abbac, bbac, abac, abac, abbc, abba, abb, aba, abc, 
aba, abc, aac, bba, bbc, bac and bac (of course, some of the resulting strings are repeated 
and they may be removed). If S 2 = baxcy, then its respective neighborhood for k = 2 will 
contain, e.g., the string bac, but the following verihcation will show that Si and S 2 are 
in edit distance greater than 2. If, however. Lev {Si, S 2 ) ^ 2, then it is impossible not to 
have in the neighborhood of S 2 at least one string from the neighborhood of Si, hence we 
will never miss a match. The lookup requires 0{km^ logn^) time (where m is the average 
word length from the dictionary) and the index occupies 0{n^) space. Another practical 
hlter was presented by Karch et al. |KLS10) and it improved on the FastSS method. 
They reduced space requirements and query time by splitting long words (similarly to 
FastBlockSS which is a variant of the original method) and storing the neighborhood 
implicitly with indexes and pointers to original dictionary entries. They claimed to be 
faster than other approaches such as the aforementioned FastSS and the BK-tree |BK73) . 
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Recently, Chegrane and Belazzongni |CB14j described another practical index and they 
reported better resnlts when compared to Karch et al. Their strnctnre is based on the 
dictionary by Belazzongni |Bel09) for the edit distance of 1 (see the following snbsection). 
An approximate (in the mathematical sense) data strnctnre for approximate matching 


which is based on the Bloom filter (see Snbsection 4.1.1) was also described |MW94a] . 


4.2.1 The 1-error problem 

It is important to consider methods for detecting a single error, since over 80% of er¬ 
rors (even np to ronghly 95%) are within k = 1 for the edit distance with transposi¬ 
tions |Dam64| IPZ84] . Belazzongni and Ventnrini |BV12) presented a compressed index 
whose space is bonnded in terms of the fc-th order empirical entropy of the indexed dic¬ 
tionary. It can be based either on perfect hashing, having 0{m + occ) qnery time or 
on a compressed permnterm index with 0(m min(m, log^^ n log log n) -|- occ) time (when 
a = log'^n for some constant c) bnt improved space reqnirements. The former is a 
compressed variant of a dictionary presented by Belazzongni |Be]n9| which is based on 
neighborhood generation and occnpies 0(n logu) space and can answer qneries in 0{m) 
time. Chnng et al. |CTW14] showed a theoretical work where external memory is nsed, 
and their focns is on I/O operations. They limited the nnmber of these operations to 
0(1 -|- m/wB + occjB)^ where w is the size of the machine word and B is the nnmber of 
words within a block (a basic nnit of I/O), and their strnctnre occnpies 0{n/B) blocks. 
In the category of filters, Mor and Fraenkel |MF82) described a method which is based 
on the deletion-only 1-neighborhood. 

For the 1-mismatch problem, Yao and Yao |YY95) described a data strnctnre for bi¬ 
nary strings of fixed length m with 0(mloglog \T)\) qnery time and 0{\T)\m\ogm) space 
reqnirements. This was later improved by Brodal and Gqsieniec |BG96) with a data 
strnctnre with 0{m) qnery time which occnpies 0(n) space. This was improved with a 
strnctnre with 0(1) qnery time and 0(|iD| logm) space in the cell probe model (where 
only memory accesses are connted) |BVnn) . Another notable example is a recent theo¬ 
retical work of Ghan and Lewenstein |GL15| , who introdnced the index with the optimal 
qnery time, i.e. 0{m/w+occ), which nses additional 0{wdlog^~^’' d) bits of space (beyond 
the dictionary itself), assnming a constant-size alphabet. 


4.2.2 Permuterm index 

A permnterm index is a keyword index which snpports qneries with one wildcard sym¬ 
bol |Gar76j . The idea is store all rotations of a given word appended with the terminating 
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character, for instance for the word text, the index would consist of the following per- 
muterm vocabulary: text$, ext$t, xt$te, t$tex, $text. When it comes to search¬ 
ing, the query is first rotated so that the wildcard appears at the end, and subsequently 
its prefix is searched for using the index. This could be for example a trie, or any other 
data structure which supports a prefix lookup. 


The main problem with the standard permuterm index is its space usage, as the num¬ 
ber of strings inserted into the data structure is the number of words multiplied by 
the average string length. Ferragina and Venturini |FV10) proposed a eompressed per¬ 
muterm index in order to overcome the limitations of the original structure with respect 
to space. They explored the relation between the permuterm index and the BWT (see 


Subsection 3.1.4.11, which is applied to the concatenation of all strings from the input 
dictionary, and they provided a modification of the LF-mapping known from FM-indexes 
in order to support the functionality of the permuterm index. 


4.2.3 Split index 

One of the practical approximate indexes was described by Cislak (thesis author) and 
Grabowski |CG15| . Experimental results for this structure can be found in Section |5.2| 

As indexes supporting approximate matching tend to grow exponentially in k, the max¬ 
imum number of allowed errors, it is also a worthwhile goal to design efficient indexes 
supporting only a small k. For this reason, we focus on the problem of dictionary match¬ 
ing with few mismatches, especially one mismatch, where Ham{di, P) ^ k for a collection 
of words d € P, a pattern P, and the Hamming distance Ham. The algorithm that we 
are going to present is uncomplicated and based on the Dirichlet principle, ubiquitous 
in approximate string matching techniques. We partition each word d into A: -|- 1 disjoint 
pieces pi,... ,pk+i, of average length \d\/{k + 1 ) (hence the name “split index”), and 
each such piece acts as a key in a hash table Ht- The size of each piece pi of word d 
is determined using the following formula: \pi\ = [|d|/(A: -|- 1)] (or \pi\ = [|d|/(A: -|- 1 )J, 
depending on the practical evaluation) and \pk+i\ = |d| — '^i=i \Pi\^ he- the piece size is 
rounded to the nearest integer and the last piece covers the characters which are not in 
other pieces. This means that the pieces might be in fact unequal in length, e.g., 3 and 
2 for \d\ = 5 A k = 1. The values in Ht are the lists of words which have one of their 
pieces as the corresponding key. In this way, every word occurs on exactly A: -|- 1 lists. 
This seemingly bloats the space usage, still, in the case of small k the occupied space is 
acceptable. Moreover, instead of storing full words on the respective lists, we only store 
their “missing” prefix or suffix. For instance for the word table and A: = 1 , we would 
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have a relation tab —?■ le on one list (i.e. tab wonld be the key and le wonld be the 
valne) and le —)■ tab on the other. 

In the case of A: = 1, we first popnlate each list with the pieces withont the prefix and 
then with the pieces withont the snfHx; additionally we store the position on the list (as 
a 16-bit index) where the latter part begins. In this way, we traverse only a half of a 
list on average dnring the search. We can also snpport k larger than 1 — in this case, 
we ignore the piece order on a list, and we store [log 2 (fc -|- 1)] bits with each piece that 
indicate which part of the word is the list key. Let ns note that this approach wonld also 
work for k = 1, however, it tnrned ont to be less efficient. 


As regards the implementation (in the CH—h langnage), onr focns is on data compactness. 
In the hash table, we store the bnckets which contain word pieces as keys (e.g., le) and 
pointers to the lists which store the missing pieces of the word (e.g., tab, ft). These 
pointers are always located right next to the keys, which means that nnless we are 
very nnlncky, a specific pointer shonld already be present in the CPU cache dnring the 
traversal. The memory layonts of these snbstrnctnres are fnlly contignons. Snccessive 
strings are represented by mnltiple characters with a prepended 8-bit connter which 
specifies the length, and the connter with the valne 0 indicates the end of the list. 
Dnring the traversal, each length can be compared with the length of the piece of the 
pattern. As mentioned before, the words are partitioned into pieces of fixed length. This 
means that on average we calcnlate the Hamming distance for only half of the pieces 
on the list, since the rest can be ignored based on their length. Any hash fnnction for 
strings can be nsed, and two important considerations are the speed and the nnmber of 
collisions, since a high nnmber of collisions resnlts in longer bnckets, which may in tnrn 
have a negative effect on the qnery time (this snbject is explored in more detail along 
with the resnlts in Chapter]^. Fignre 4.3 illnstrates the layont of the split index. 


The preprocessing stage proceeds as follows: 


1. Dnplicate words are removed from the dictionary D. 


The following steps refer to each word d from D: 


2 . The word d is split into A: -|- 1 pieces. 

3. For each piece pf. if pi ^ Ht, we create a new list In containing the missing pieces 
(later simply referred to as a missing piece; in the case of A: = 1, this is always one 
contignons piece) V = {pj : j G [1, A: -|- 1] A j ^ i} and add it to the hash table (we 
append pi and the pointer to In to the bncket). Otherwise, if pi G Ht, we append 
the missing pieces V to the already existing list k. 
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Figure 4.3; Split index for keyword indexing which shows the insertion of the word 
table for k = 1. The index also stores the words left and tablet (only selected lists 
containing pieces of these two words are shown), and LI and L2 indicate pointers to 
the respective lists. The first cell of each list indicates a 1-based word position (i.e. the 
word count from the left) where the missing prefixes begin {k = 1, hence we deal with 
two parts, namely prefixes and suffixes), and 0 means that the list has only missing 
suffixes. Adapted from Wikimedia Common^ 


As regards the search: 


1. The pattern P is split into k + 1 pieces. 

2. We search for each piece pi (the prefix and the suffix if /c = 1): the list k is retrieved 
from the hash table or we continue if pi ^ Ht- We traverse each missing piece pj 
from li. If \pj\ = |P| — \pi\, the verification is performed and the result is returned 
if Ham{pj, P — pi) ^ k. 

3. The pieces are combined into one word in order to form the answer. 

^Jorge Stolfi, available at http;//en.Wikipedia.org/wiki/File:Hash_table_3_l_l_0_l_ 
0_0_SP.svg, CC A-SA 3.0. 
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4.2.3.1 Complexity 

Let us consider the average word length \d\, where |d| = (X]i=i l^*l)/l^l' The average 
time complexity of the preprocessing stage is equal to 0{kn), where k is the allowed 
number of errors, and n is the total input dictionary size (i.e. the length of the concate¬ 
nation of all words from D, n = X]l=i 1^*1)- This is because for each word and for each 
piece Pi we either add the missing pieces to a new list or append them to the already 
existing one in 0(|ci|) time (let us note that \T>\ ■ \d\ = n). We assume that adding a new 
element to the bucket takes constant time on average, and the calculation of all hashes 
takes 0{n) time in total. This is true irrespective of which list layout is used (there are 
two layouts for k = 1 and A: > 1, see the preceding paragraphs). The occupied space is 
equal to 0{kn), because each piece appears on exactly k lists and in exactly 1 bucket. 

The average search complexity is equal to 0{kt), where t is the average length of the 
list. We search for each of k + 1 pieces of the pattern of length m, and when the 
list corresponding to the piece p is found, it is traversed and at most t verifications 
are performed. Each verification takes at most 0(min(m, |dmax|)) time where dmax is 
the longest word in the dictionary (or 0{k) time, in theory, using the old technique 
from Landau and Vishkin |LV89| . after 0(nlogcr)-time preprocessing), but 0(1) time 
on average. Again, we assume that determining a location of the specific list, that is 
iterating a bucket, takes 0(1) time on average. As regards the list, its average length t is 
higher when there is a higher probability that two words di and ^2 from D have two parts 
of the same length I which match exactly, i.e. Pr{di[si, si — 1] = d 2 [s 2 , S 2 + I — f])■ 
Since all words are sampled from the same alphabet S, t depends on the alphabet size, 
that is t = /((t). Still, the dependence is rather indirect; in real-world dictionaries which 
store words from a given language, t will be rather dependent on the A;-th order entropy 
of the language. 


4.2.3.2 Compression 

In order to reduce storage requirements, we apply a basic compression technique. We 
find the most frequent g-grams in the word collection and replace their occurrences on 
the lists with unused symbols, e.g., byte values 128,..., 255. The values of q can be 
specified at the preprocessing stage, for instance q = 2 and g = 4 are reasonable for 
the English alphabet and DNA, respectively. Different g values can be also combined 
depending on the distribution of g-grams in the input text, i.e. we may try all possible 
combinations of g-grams up to a certain g value and select ones which provide the best 
compression. In such a case, longer g-grams should be encoded before shorter ones. For 
example, a word compression could be encoded as #p*s\ using the following substitution 
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list: com —)■ #, re —)■ *,co —)■ $, om —> &, sion —?■ \ (note that not all g-grams from the 
snbstitntion list are nsed). Possibly even a recnrsive approach conld be applied, althongh 


this wonld certainly have a snbstantial impact on the qnery time. See Section 5.2 for the 
experimental resnlts and a farther discnssion. 


The space nsage conld be farther rednced by the nse of a different character encoding. 
For the DNA (assnming 4 symbols only) it wonld be snfhcient to nse 2 bits per character, 
and for the basic English alphabet 5 bits. In the latter case there are 26 letters, which in 
a simplified text can be angmented only with a space character, a few pnnctnation marks, 
and a capital letter flag. Snch an approach wonld be also beneficial for space compaction, 
and it conld have a farther positive impact on cache nsage. The compression natnrally 
rednces the space while increasing the search time, and a sort of a middle gronnd can 
be achieved by deciding which additional information to store in the index. This can 
be for instance the length of an encoded (compressed) piece after decoding, which conld 
eliminate some pieces based on their size withont performing the decompression and 
explicit verification. 


4.2.3.3 Parallelization 

The algorithm conld be sped np by means of parallelization, since index access dnring the 
search procednre is read-only. In the most straightforward approach, we conld simply 
distribnte individnal qneries between mnltiple threads. A more fine-grained variation 
wonld be to concnrrently operate on word pieces after the word has been split np (with 
the nnmber of pieces being dependent on the k parameter). We conld even access in 
parallel the lists which contain missing pieces (prefixes and snffixes for k = 1), althongh 
the gain wonld be probably limited since these lists nsnally store at most a few words. 
If we had a snfhcient amonnt of threads at onr disposal, these approaches conld be 
combined. Still, it is to be noted that the nse of mnltiple threads has a negative effect 
on cache ntilization. 


4.2.3.4 Inverted split index 


The split index conld be extended in order to inclnde the fnnctionality of an inverted 


index for approximate matching. As mentioned in Snbsection 4.1.2, the inverted index 
conld be in practice any data strnctnre which snpports an efficient word looknp. Let ns 
consider the compact list layont of the split index presented in Fignre |4.3| where each 
piece is located right next to other pieces. Instead of storing only the 8-bit connter which 
specihes the length of the piece, we conld also store (right next to this piece) its position 
in the text. Snch an approach wonld increase the average length of the list only by 
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constant factor and it would not break the contiguity of the lists, while also keeping the 
0{kn) space complexity. Moreover, the position should be already present in the CPU 
cache during the list traversal. 


4.3 Keyword selection 


Keyword indexes can be also used in the scenario where there are no explicit boundaries 
between the words. In such a case, we would like to select the keywords according to 
a well-defined set of rules and form a dictionary T) from the input text T. Such an 
index which stores ( 7 -grams sampled from the input text may be referred to as a ( 7 -gram 
index. It is useful for answering keyword rather than full-text queries, which might be 
required for example due to time requirements (i.e. when we would like to trade space 
for speed). Examples of the input which cannot be easily divided into words include 
some natural languages (e.g., Chinese) where it is not possible to clearly distinguish the 
words (their boundaries depend on the context) or other kinds of data such as a complete 
genome |KWLLn5) . 


Let us consider the input text T which is divided into n — q + 1 ( 7 -grams. The issue 
lies in the amount of space which is occupied by all g-gram tuples where s is 

the g-gram and /j identifies its positions, which is in the order of 0{n) or 0{nqmax) 
for all possible g-grams up to some qmax value. General compression techniques are 
usually not sufficient and thus a dedicated solution is required. This is especially the 
case in the context of bioinformatics where data sets are substantial; the applications 
could be for instance retrieving the seeds in the seed-and-extend algorithm described in 


Section 3.3 One of the approaches was proposed by Kim et al. |KWLL05| . and it aims to 
eliminate the redundancy in position information. Consecutive ( 7 -grams are grouped into 
subsequences, and each ( 7 -gram is identified by the position of the subsequence within the 
documents and the position of a g-gram within the subsequence, which forms a two-level 
index structure. This concept was also extended by the original authors to include the 
functionality of approximate matching |KWL07] . 


4.3.1 Minimizers 


The idea of minimizers was introduced by Roberts et al. |R,HH+n4| (with applica¬ 
tions in, e.g., genome sequencing with de Bruijn graphs ICLJ"*"!! and /c-mer count¬ 
ing |DKGDG15| 1. and it consists in storing only selected rather than all ( 7 -grams from 
the input text. Here, the goal is to choose such g-grams from a given string S (a set 
M(5)), so that for two strings S'! and S 2 , if it holds for a pattern P that P C 5i AP C S '2 
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and |P| is above some threshold, then it should also hold that |M(S'i) H M(S' 2 )| ^ 1. In 
order to find a (a, g)-minimizer, we slide a window of length q + a — 1 (a consecutive 
( 7 -grams) over T, shifting it by 1 character at a time, and at each window position we 
select a (/-gram which is the smallest one lexicographically (ties may be resolved for in¬ 


stance in favor of the leftmost of the smallest substrings). Figure 4.4 demonstrates this 
process. 


T 

texting 

ITi 

text 

W 2 

e X t i 

IF3 

X t i n 

IT4 

ting 


Figure 4.4: Selecting (3, 2)-minimizers (underlined), that is choosing 2-grams while 
sliding a window of length 4 (3 -I- 2 — 1 = 4) over the text texting. The results belong 

to the following set: {ex, in}. 


Let us repeat an important property of the minimizers which makes them useful in 
practice. If for two strings Si and S 2 it holds that P C 5i A P C ^2 A |P| q + a — 
then it is guaranteed that Si and S 2 share an {a, q)-minimizer (because they share one 
full window). This means that for certain applications we can still ensure that no exact 
matches are overlooked by storing the minimizers rather than all ( 7 -grams. 


4.4 String sketches 

We introduce the concept of string sketches, whose goal is to speed up string comparisons 
at the cost of additional space. For a given string 5, a sketch S' is constructed as 
S' = f{S) using some function / which returns a fixed-sized block of data. In particular, 
for two strings Si and S 2 , we would like to determine with certainty that Si 7 ^ S 2 or 
Ham{Si, S 2 ) ^ k when comparing only sketches S{ and S^. There exists a similarity 
between sketches and hash functions, however, hash comparison would work only in the 
context of exact matching. When the sketch comparison is not decisive, we still have 
to perform an explicit verification on Si and S 2 , but the sketches allow for reducing 
the number of such verifications. Since the sketches refer to individual words, they are 
relevant in the context of keyword indexes. Assuming that each word d € D is stored 
along with d', sketches could be especially useful if the queries are known in advance or 
|P| is relatively high, since sketch calculation might be time-consuming. 

Sketches use individual bits in order to store information about g'-gram frequencies in 
the string. Various approaches exist, and main properties of the said g-grams include: 
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• Size — for instance individual letters (l-grams) are sensible for the English alpha¬ 
bet but pairs (2-granis) might be better for the DNA. 

• Frequency — we can store binary information in each bit that indicates whether a 
certain g-gram appears in the string (8 g-grams in total for a 1 -byte sketch, we call 
this approach an occurrence sketch), or we can store their count (at most 3) using 
2-bits per q-gram (4 g-grams in total for a 1-byte sketch, we call this approach a 
count sketch). 

• Selection — which ( 7 -grams should be included in the sketches. These could be 
for instance ( 7 -grams which occur most commonly in the sample text. 

For instance, let us consider an occurrence sketch which is built over 8 most common 
letters of the English alphabet, namely {e, t, a, o, i, n, s, h} (consult Appendix to see 
the frequencies). For the word instance, the 1-byte sketch where each bit corresponds 
to one of the letters from the aforementioned set would be as follows: 11101110 . 

We can quickly compare two sketches by taking a binary xor operation and counting 
the number of bits which are set in the result (calculating the Hamming weight, Hy/). 
Note that Hy/ can be determined in constant time using a lookup table of size 2®” 
bytes, where n is the sketch size in bytes. We denote the sketch difference with Hs, 
and Hs = © S"^). Let us note that Hs does not determine the number of 

mismatches, for instance for Si = run and S 2 = ran, Hs{Si, S 2 ) might be equal to 2 
(occurrence differences in a and u) but there is still only one mismatch. On the other 
extreme, for two strings of length n where each string consists of a repeated occurrence 
of one different letter, Hs might be equal to 1, but the number of mismatches is n. In 
general, Hs can be used to provide a lower bound on the true number of errors. For 
sketches which record information about single characters ( 1 -grams), the following holds: 
Ham{Si, S 2 ) ^ \Hs{S[, S 2 )/ 2 \ (the right-hand side can be calculated quickly using a 
lookup table, since 0 ^ Hs ^ 8 ). The true number of mismatches is underestimated 
especially by count sketches, since we calculate the Hamming weight instead of comparing 
the counts. For instance, for the count of 3 (bits 11 ) and the count of 1 (bits 01 ), the 
difference is 1 instead of 2. Still, even though the true error is higher than Hs, sketches 
can be used in order to speed up the comparisons because certain strings will be compared 
(and rejected) in constant time using fast bitwise operations and array lookups. As 
regards the space overhead incurred by the sketches, it is equal to 0 ( 1 X 11 -|- a), since we 
have to store one constant-size sketch per word together with the lookup tables which 
are used to speed up the processing. Consult Section [5.3| in order to see the experimental 
results. 



Chapter 5 


Experimental Results 


The results were obtained on the machine with the following specifications: 

• Processor — Intel i5-3230M running at 2.6 GHz 

• RAM — 8 GB DDR3 memory 

• Operating system — Ubuntu 14.04 64-bit (kernel version 3.16.0-41) 

Programs were written in the G++ programming language (with certain prototypes in 
the Python language) using features from the GH—hll standard |cppll| . They use the 
G++ Standard Library, Boost libraries (version 1.57) |Kar05j . and Linux system libraries. 
Gorrectness was analyzed using Valgrind |NS03| . a tool for error checking and profiling 
(no errors or memory leaks were reported). The source code was compiled (as a 32-bit 
version) with clang compiler v. 3.4-1, which turned out to be produce a slightly faster 
executable than the gcc when checked under the -03 optimization flag. 


5.1 FM-bloated 


For the description of the FM-bloated structure consult Subsection 3.1.6 Here, we 


present experimental results for the superlinear index version. As regards the hash 
function, xxhash was used (available on the Internet, consult Appendix [F|) , and the load 
factor was equal to 2.81. 


The length of the pattern has a crucial impact on the search time, since the number 
LF-mapping steps is equal the number of Is in the binary representation of m. This 
means that the search will be the fastest for m in the form 2^ (constant time for m up 
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to a certain maximum value) and the slowest for m in the form 2^ — 1, where c G Z. We 
can see in Figure [STT] that the query time also generally decreases as the pattern length 
increases, mostly due to the fact that the times are given per character. The results are 
the average times calculated for one million queries which were extracted from the input 
text. 



Figure 5.1; Query time per character vs pattern length (16, 20, 22, 24, 31, 32, 64, 80, 
82, 85, 90, 100, 102, 126, 127, and 128) for the English text of size 30 MB. Let us point 
out notable differences between pattern lengths 31 and 32, and 127 and 128. 


We also compare our approach with other FM-index-based structures (consult Fig¬ 
ure 


5.2). We used the implementations from the sdsl library |GBMP14] (available on 


the Internet lidi]) and the implementations of FM-dummy structures by Grabowski et 
al. |GRD15) (available on the Internet jranj). As regards the space, the FM-bloated struc¬ 
ture (just as the name suggests) is roughly two order of magnitude bigger than other 
indexes. The index size for other methods ranged from approximately 0.6n to 4.25n, 
where n is the input text size. FM-bloated, on the other hand, occupied the amount of 
space equal to almost 85n (for qmax = 128). 


5.2 Split index 


In this section we present the results which appeared in a preprint by Gislak (thesis 
author) and Grabowski |GG15) . For the description of the split index consult Subsec¬ 
tion 14.2.31 
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Figure 5.2: Query time per character vs pattern length (16, 24, 64, 80, and 120) for 
different methods for the English text of size 30 MB. Note the logarithmic y-scale. 


One of the crucial components of the split index is a hash function. Ideally, we would 
like to minimize the average length of the bucket (let us recall that we use chaining for 
collision resolution), however, the hash function should be also relatively fast because 
it has to be calculated for each of A: + 1 parts of the pattern (of total length m). We 
investigated various hash functions, and it turned out that the differences in query times 
are not negligible, although the average length of the bucket was almost the same in 
all cases (relative differences were smaller than 1%). We can see in Table [sTT] that the 
fastest function was the xxhash (available on the Internet, consult Appendix [F]) , and for 
this reason it was used for the calculation of other results. 


Hash function 
xxhash 
sdbm 
FNVl 
FNVla 
SuperFast 
Murmur3 
City 
FARSH 
Spooky V2 
Farm 


Query time (ps) 
093 
0.95 
0.95 
0.95 
0.96 
0.97 
0.99 
1.00 
1.04 
1.04 


Table 5.1: Evaluated hash functions and search times per query for the English 
dictionary of size 2.67 MB and fc = 1. A list of common English misspellings was used 

as queries, max ILF]= 2.0. 
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Decreasing the value of the load factor did not strictly provide a speedup in terms of the 


query time, as demonstrated in Figure 5.3 This can be explained by the fact that even 
though the relative reduction in the number of collisions was substantial, the absolute 
difference was equal to at most a few collisions per list. Moreover, when the ILFI was 
higher, pointers to the lists could be possibly closer to each other, which might have had 
a positive effect on cache utilization. The best query time was reported for the maximum 
iLFl value of 2.0, hence this value was used for the calculation of other results. 




Figure 5.3; Query time and index size vs the load factor for the English dictionary 
of size 2.67 MB and fc = 1. A list of common English misspellings was used as queries. 
The value of ILFI can be higher than 1.0 because we use chaining for collision resolution. 


In Table 5.2 we can see a linear increase in the index size and an exponential increase 
in query time with growing k. Even though we concentrate on A: = 1 and the most 
promising results are reported for this case, our index might remain competitive also for 
higher k values. 


k 

Query time (ps) 

Index size (KB) 

1 

0.51 

1,715 

2 

11.49 

2,248 

3 

62.85 

3,078 


Table 5.2: Query time and index size vs the error value k for the English language 
dictionary of size 0.79 MB. A list of common English misspellings was used as queries. 


Q-gram substitution coding provided a reduction in the index size, at the cost of increased 
query time. Q-grams were generated separately for each dictionary E as a list of 100 
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g-grams which provided the best compression for E, i.e. they minimized the size of 
all encoded words, Se = \Enc{di)\. For the English langnage dictionaries, we 

also considered nsing only 2-grams or only 3-grams, and for the DNA only 2-grams (a 
maximnm of 25 2-grams) and 4-grams, since mixing the g-grams of varions sizes has a 
further negative impact on the qnery time. For the DNA, 5,000 qneries were generated 
randomly by introdncing noise into words sampled from dictionary, and their length was 
eqnal to the length of the particnlar word. Up to 3 errors were inserted, each with a 50% 
probability. For the English dictionaries we opted for the list of common misspellings, 
and the resnlts were similar to the case of randomly generated qneries. The evalnation 
was rnn 100 times and the resnlts were averaged. 


We can see the speed-to-space relation for the English dictionaries in Fignre 5.4 and for 


the DNA in Fignre 5.5 In the case of English, nsing the optimal (from the compression 
point of view, i.e. minimizing the index size) combination of mixed g-grams provided al¬ 
most the same index size as nsing only 2-grams. Snbstitntion coding methods performed 
better for the DNA (where cr = 5) becanse the seqnences are more repetitive. Let ns note 
that the compression provided a higher relative decrease in index size with respect to 
the original text as the size of the dictionary increased. For instance, for the dictionary 
of size 627.8 MB the compression ratio was eqnal to 1.93 and the qnery time was still 
aronnd 100 ps. Consnlt Appendix [C] for more information abont the compression. 



Figure 5.4: Query time and index size vs dictionary size for A: = 1, with and without 
g-gram coding. Mixed g-grams refer to the combination of g-grams which provided the 
best compression, and for the three dictionaries these were equal to ([2-, 3-, 4-] grams): 
[88, 8, 4], [96, 2, 2[, and [94, 4, 2[, respectively. English language dictionaries and the 
list of common English misspellings were used. 












Experimental Results 


61 



Figure 5.5: Query time and index size vs dictionary size for k = 1, with and without 
g-gram coding. Mixed g-grams refer to the combination of g-grams which provided 
the best compression, and these were equal to ([2-, 3-, 4-] grams): [16, 66, 18] (due to 
computational constraints, they were calculated only for the first dictionary, but used 
for all four dictionaries). DNA dictionaries and the randomly generated queries were 

used. 


Tested on the English language dictionaries, promising results were reported when com¬ 
pared to methods proposed by other authors. Others consider the Levenshtein distance 
as the edit distance, whereas we use the Hamming distance, which puts us at the ad¬ 
vantageous position. Still, the provided speedup is signihcant, and we believe that the 
more restrictive Hamming distance is also an important measure of practical use (see 


Subsection 2.1.1 for more information). The implementations of other authors are avail¬ 
able on the Internet |boy| Iche] . As regards the results reported for the IMFI and Boytsov’s 
Reduced alphabet neighborhood generation, it was not possible to accurately calculate 
the size of the index (both implementations by Boytsov), and for this reason we used 
rough ratios based on index sizes reported by Boytsov for similar dictionary sizes. Let 
us note that we compare our algorithm with Chegrane and Belazzougui |CB14) . who 
published better results when compared to Karch et al. [KBSIO) . who in turned claimed 
to be faster than other state-of-the-art methods. We have not managed to identify any 
practice-oriented indexes for matching in dictionaries over any hxed alphabet S dedi¬ 
cated for the Hamming distance, which could be directly compared to our split index. 
The times for the brute-force algorithm are not listed, since they were roughly 3 orders 
of magnitude higher than the ones presented. Consult Figure [5(6| for details. 


We also evaluated different word splitting schemes. For instance for fc = 1, one could 
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Figure 5.6: Query time vs index size for different methods. The method with com¬ 
pression encoded mixed g-grams. We used the Hamming distance, and the other authors 
used the Levenshtein distance for k = 1. English language dictionaries of size 0.79 MB, 
2.67MB, and 5.8 MB were used as input, and the list of common misspellings was used 

for queries. 


split the word into two parts of different sizes, e.g., 6 —t (2,4) instead of 6 —)> (3, 3), how¬ 
ever, nneqnal splitting methods cansed slower qneries when compared to the regnlar one. 
As regards Hamming distance calcnlation, it tnrned ont that the naive implementation 
(i.e. simply iterating and comparing each character) was the fastest one. The com¬ 
piler with antomatic optimization was simply more efficient than other implementations 
(e.g., ones based directly on SSE instrnctions) that we have investigated. 


5.3 String sketches 


String sketches which were introdnced in Section 4.4 allow for faster string comparison, 
since in certain cases we can dednce for two strings Si and S 2 that D{Si, S 2 ) ^ k for some 
k withont performing an explicit verification. In onr implementation, a sketch comparison 
reqnires performing one bitwise operation and one array looknp, i.e. 2 constant operations 
in total. We analyze the comparison time between two strings nsing varions sketch 
types versns an explicit verification. The sketch is calcnlated once per qnery and it is 
then rensed for the comparison with consecntive words, i.e. we examine the sitnation 
where a single qnery is compared against a dictionary of words. The dictionary size 
for which a speednp was reported was aronnd 100 words or more, since in the case of 
fewer words sketch constrnction was too slow in relation with the comparisons. When the 
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sketch comparison was not decisive a verification was performed and it contributed to the 
elapsed time. The words were generated over the English alphabet (consult Appendix 
in order to see letter frequencies), and each sketch occupied 2 bytes (1-byte sketches were 
not effective). Figures 5.7 and 5.8 contain the results for occurrence and count sketches, 
respectively. Consult Appendix [D] for more information regarding the letter distribution 
in the alphabet. 



Word size 


Figure 5.7; Comparison time vs word size for 1 mismatch using occurrence sketches 
for words generated over the English alphabet. Each sketch occupies 2 bytes, and time 
refers to average comparison time between a pair of words. Common sketches use 16 
most common letters, rare sketches use 16 least common letters, and mixed sketches 
use 8 most common and 8 least common letters. Note the logarithmic x-scale. 
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Figure 5.8: Comparison time vs word size for 1 mismatch using count sketches for 
words generated over the English alphabet. Each sketch occupies 2 bytes, and time 
refers to average comparison time between a pair of words. Common sketches use 16 
most common letters, rare sketches use 16 least common letters, and mixed sketches 
use 8 most common and 8 least common letters. Note the logarithmic x-scale. 








Chapter 6 


Conclusions 


String searching algorithms are ubiquitous in computer science. They are used for com¬ 
mon tasks performed on home PCs such as searching inside text documents or spell 
checking as well as for industrial projects, e.g., genome sequencing. Strings can be de¬ 
fined very broadly, and they usually contain natural language and biological data (DNA, 
proteins), but they can also represent various kinds of data such as music or images. 
An interesting aspect of string matching is the diversity and complexity of the solutions 
which have been presented over the years (both theoretical and practical), despite the 
simplicity of problem formulation (one of the most common ones being “check if pattern 
P exists in text T”). 

We investigated string searching methods which preprocess the input text and construct 
a data structure called an index. This allows to reduce the time required for searching, 
and it is often indispensable when it comes to massive sizes of modern data sets. The 
indexes are divided into full-text ones which operate on the whole input text and can 
answer arbitrary queries, and keyword indexes which store a dictionary of individual 
( 7 -grams (these can corresponds to, e.g., words in a natural language dictionary or DNA 
reads). 

Key contributions include the structure called FM-bloated, which is a modification of 
the FM-index (a compressed, full-text index) that trades space for speed. Two vari¬ 
ants of the FM-bloated were described — one using O(nlog^n) bits of space with 
0 (m-|-log m log log n) average query time, and one with linear space and 0 (m log log n) 
average query time, where n is the input text length and m is the pattern length. We 
experimentally show that by operating on g-grams in addition to individual characters a 
significant speedup can be achieved (albeit at the cost of very high space requirements, 
hence the name “bloated”). 
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The split index is a keyword index for the fc-mismatches problem with a focus on the 
1-error case. It performed better than other solutions for the Hamming distance, and 
times in the order of 1 microsecond were reported for one mismatch for a few-megabyte 
natural language dictionary on a medium-end PC. A minor contribution includes string 
sketches which aim to speed up approximate string comparison at the cost of additional 
space ( 0 ( 1 ) per string). 


6.1 Future work 


We presented results for the superhnear variant of the FM-bloated index in order to 
demonstrate its potential and capabilities. Multiple modihcations and implementations 
of this data structure can be introduced. Let us recall that we store the count table and 
occurrence lists for selected g-grams in addition to individual characters from the regular 
FM-index. This ( 7 -gram selection process can be hne-tuned — the more ( 7 -grams we store, 
the faster the search should be, but the index size grows as well. For instance, the linear 
space version could be augmented with additional 1 -, 2 -, etc ( 7 -grams which start at the 
position of each minimizer, up to an s-gram where s is the maximum gap size between 
two minimizers. This would eliminate two phases of the search (for prefixes and suffixes. 


cf. Subsection 3.1.6.2) where individual characters have to be used for the LF-mapping 


mechanism. Moreover, the comparison with other methods could be augmented with an 
inverted index on ( 7 -grams, whose properties should be more similar to FM-bloated than 
those of FM-index variants, especially when it comes to space requirements. 

As regards the split index, we describe possible extensions in Subsections |4.2.3.3| and 


4.2.3.4 These include using multiple threads and introducing the functionality of an 
inverted index on ( 7 -grams. Moreover, the algorithm could be possibly extended to handle 
the Levenshtein distance as well, although this would certainly have a substantial impact 
on space usage. Another desired functionality could include a dedicated support for a 
binary alphabet ((7 = 2). In such a case, individual characters could be stored with bits, 
which should have a positive effect on cache usage thanks to further data compaction 
and possibly an alignment with the cache line size. 







Appendix A 


Data Sets 


The following tables present information regarding the data sets that were used in 


this work. Table A.l describes data sets from the popular Pizza & Chili flPfcCI) cor¬ 
pus |FGNV0^ |piz| , which were used for full-text indexes (English.30 was extracted 
from English.50). Table A.2 describes data sets which were used for keyword indexes. 
The English dictionaries come from Linux packages and the webpage by Foster |fos) . 
and the list of common misspellings which were used as queries was obtained from the 
Wikipedia 


The DNA dictionaries contain 20-mers which were extracted from the 
genome of Drosophila melanogaster that was collected from the FlyBase database |fly| . 
The provided sizes refer to the size of the dictionary after preprocessing — for keyword 
indexes, duplicates as well as delimiters (usually newline characters) are removed. The 
abbreviation INLl refers to natural language. 


Name 

Source 

Type 

Size 

English. 30 

IPX^CI 

INLl (English) 

30 MB 

English. 50 

IPX^CI 

INLl (English) 

50 MB 

English.200 

IPfcCI 

INLl fEnglishl 

200 MB 


Table A.l: A summary of data sets which were used for the experimental evaluation 

of full-text indexes. 
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Name 

Source 

Type 

Size 

iamerican 

Linux package 

INLI (English) 

0.79 MB 

foster 

Foster 

INLI (English) 

2.67MB 

iamerican-insane 

Linux package 

INLI (English) 

5.8 MB 

misspellings 

Wikipedia 

INLI (English) 

42.2 KB (4,261 words) 

dmel-tiny 

FlyBase 

DNA 

6.01MB 

dmel-small 

FlyBase 

DNA 

135.89 MB 

dmel-medium 

FlyBase 

DNA 

262.78 MB 

dmel-big 

FlyBase 

DNA 

627.80 MB 


Table A.2; A summary of data sets which were used for the experimental evaluation 

of keyword indexes. 





Appendix B 


Exact Matching Complexity 


In the theoretical analysis we often mention exact string comparison, i.e. determining 
whether Si = S 2 . It must hold that |Si| = IS 2 I, and the worst-case complexity of this 
operation is equal to 0{n) (all characters have to be compared when the two strings 
match). On the other hand, the average complexity depends on the alphabet S. If, for 
instance, cr = 2, we have 1/2 probability that characters Si[0] and S2[0] match, 1/4 that 
characters Si[l] and S 2 [l] match as well, etc, in the case of uniform letter frequencies. 
More generally, the probability that there is a match between all characters up to a 0- 
based position i is equal to l/cr*'*'^, and the average number of required comparisons Ac 
is equal to 1-|-1 /(cj —1) for any cr ^ 2. We can derive the following relation: lim Ac = 1, 

(T^OO 

and hence treating the average time required for exact comparison of two random strings 


from the same alphabet S as 0(1) is justihed for any a. In Figure B.l we present the 
relation between the average number of comparisons and the a value. 


In the case of real-world S such as the English language alphabet, context information 
in the form of /c-th order entropy should be taken into account. In a simplihed analysis, 
let us consider the frequencies from Appendix E the probability that two characters 
sampled at random match is equal to 0.127^ (for a) -|- 0.091^ (for t), etc. Proceeding in 
this manner, the probability for the match between the hrst pair of characters is equal to 
6.55%, for the hrst and the second pair 0.43%, etc. As regards an empirical evaluation 
on the English.200 text, the average number of comparisons between a random pair of 
strings was equal to approximately 1.075. 
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Figure B.l: Average number of character comparisons when comparing two random 
strings (for exact matching) from the same alphabet with uniform letter frequency vs 

the alphabet size cr. 




Appendix C 


Split Index Compression 


This appendix presents additional information regarding the g-gram-based compression 
of the split index (consult Subsection |4.2.3 for the description of this data structure and 
Section 5.2 for the experimental results). In Figures C.l and C.2 we can see the relation 
between the index size and the selection of 100 2-grams and 3-grams for the English 
alphabet (where the 2-grams clearly provided a better compression) and 100 3-grams 
and 4-grams for the DNA. 



Figure C.l; Index size vs the number of 2-grams used for the compression for the 
English dictionary. 100 q-grams were used, and the remaining g-grams were 3-grams. 
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Figure C.2: Index size vs the number of 3-grams used for the compression for the 
DNA dictionary. 100 g-grams were used, and the remaining g-grams were 4-grams. 



Appendix D 


String Sketches 


In Section 5.3 we discussed the use of string sketches for the English alphabet, where 
we could take advantage of the varying letter frequency. Here, we present the results for 
the alphabet with uniform distribution and cj = 26. Instead of selecting the most or the 
least common letters, the 2-byte sketches contain information regarding 16 (occurrence) 


or 8 (count) randomly selected letters. We can see in Figure D.l that in this case the 
sketches do not provide the desired speedup. 



Figure D.l: Comparison time vs word size for 1 mismatch using various string sketches 
generated over the alphabet with uniform letter frequency and a = 26. Each sketch 
occupies 2 bytes, and time refers to average comparison time between a pair of words. 

Note the logarithmic x-scale. 
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Appendix E 


English Letter Frequency 


Frequencies presented in Table [ETL] [LewOOl p. 36] were used for the generation of random 
queries where the letter distribution corresponded to the real-world English use. 


Letter 

Frequency 

Letter 

Frequency 

e 

12.702% 

m 

2.406% 

t 

9.056% 

w 

2.361% 

a 

8.167% 

f 

2.228% 

o 

7.507% 

g 

2.015% 

i 

6.966% 

y 

1.974% 

n 

6.749% 

p 

1.929% 

s 

6.327% 

b 

1.492% 

h 

6.094% 

V 

0.978% 

r 

5.987% 

k 

0.772% 

d 

4.253% 

j 

0.153% 

1 

4.025% 

X 

0.150% 

c 

2.782% 

q 

0.095% 

u 

2.758% 

Z 

0.074% 


Table E.l; Frequencies of English alphabet letters. 
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Appendix F 


Hash Functions 


Table [O contains Internet addresses of hash functions which were used to obtain 


ex¬ 


perimental results for the split index (Section 5.2). If the hash function is not listed, it 
means that our own implementation was used. 


Name 

Address 

City 

https://code.google.com/p/cityhash/ 

Farm 

https://code.google.com/p/farmhash/ 

FARSH 

https://github.com/Bulat-Ziganshin/FARSH 

Murmur3 

https : //code. google. com/p/smhasher/wiki/MumurHash3 

Spooky V2 

http://burtleburtle.net/bob/hash/spooky.html 

SuperFast 

http://www.azillionmonkeys.com/qed/hash.html 

xxhash 

https://code.google.com/p/xxhash/ 


Table F.l: A summary of Internet addresses of hash functions. 
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List of Symbols 


B block — either an I/O unit or a fixed-size piece of a bit vector 
c character (a string of length 1) 

C count table in the FM-index 
Cl cache line size 
D distance metric 

S time required for calculating D(Si, S2) for two strings over the same 
T> dictionary of keywords (for keyword indexes) 
d word (string) from a dictionary 
Enc{d) encoded (compressed) word d 
F first column of the BWT matrix 
H hash function 

iffc(5') k-th order entropy of string S 
Ht hash table 

Hw Hamming weight (number of Is in a bit vector) 

Ham Hamming distance 

[f] number i rounded to the nearest integer 

I index for string matching 
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